Difference between revisions of "Users guide and notes Jacob"
| Line 93: | Line 93: | ||
| ==#include clause== | |||
| Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest? | |||
| A: See apertium-en-es/apertium-en-es.en.metadix.xml: | |||
| <pre> | |||
| <?xml version="1.0" encoding="UTF-8"?> | |||
| <dictionary> | |||
|   <alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet> | |||
|         <!-- symbols --> | |||
|         <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"  href="apertium-en-es.symbols.xml"/> | |||
|         <!-- paradigms --> | |||
|         <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.en.pardefs.xml"/> | |||
| </pre> | |||
| And then in apertium-en-es.symbols.xml: | |||
| <pre> | |||
| <?xml version="1.0" encoding="UTF-8"?> | |||
|   <sdefs> | |||
|     <sdef n="comp" /> | |||
|     <sdef n="detnt" /> | |||
|     <sdef n="predet" /> | |||
|     <sdef n="past" /> | |||
|     <sdef n="atn" /> | |||
| </pre> | |||
| Line 99: | Line 128: | ||
|  - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix  |  - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix  | ||
|  - make some wiki notes. |  - make some wiki notes. | ||
| ===File from traduku.net=== | |||
| convert it into EN : EO | |||
|   then tag the EO side | |||
|   and strip out the nouns and adjectives | |||
|   those are most important to start with | |||
|   then grab a corpus | |||
|   (wikipedia, or euro parl or something) | |||
| 22.15 and order them by frequency of the english word | |||
|  mig: why reorder? | |||
|  francis.tyers@gmail.com: higher frequency words are more important | |||
| 22.16 if you translate "the" correctly, you cover ~50% of the text, if you translate "gable" correctly you cover maybe 0.5% | |||
| 22.17 mig: yes, yes, but why bother if all words get in? | |||
|  francis.tyers@gmail.com: because someone has to add the inflection for the english side | |||
|   the esperanto side is regular, but the english is not always regular | |||
| 22.18 mig: OK, so reording is important because we probably wont make all 110000. | |||
|  francis.tyers@gmail.com: yep | |||
|   but the good news is we don't need to make 110000 | |||
|   we have 93% coverage with ~7,000 words | |||
|   so we can get 99% coverage with probably 20,000 | |||
Revision as of 07:31, 26 August 2008
These are my notes for making the English-Esperanto translator but I might be usefull to the same kind of people like me who knows next to nothing about linguistics.
Ive installed standtard Ubuntu packages and theyre working fine:
Contents
Using Apertium
echo "Jeg vil gå en tur" | apertium da-sv Jag vill gå en tur
or
$ echo "Jeg vil gå en tur" | apertium -d apertium-sv-da da-sv Jag vill gå en tur
don't use the command apertium-translator, its old and deprecated!
how to add a missing word
You will need to add the word in both the source language monodix AND on the translation dictionary.
Example: I want to add "treeview" which is an English noun.
First I check if its in the English monodict apertium-eo-en.en.dix. If it isnt we'll need to add it.
First we need to find the regular noun paradigm in english The paradigm is 'house__n'. Why 'house' ? Just because it's a memorable example.
Understanding the files
<e r="LR">
<l>kataluno</l><r>Catalan</r>
</e> <e r="LR">
<l>kataluno</l><r>Catalan</r>
</e> <e r="RL">
<l>kataluno</l><r>Catalan</r>
</e> <e r="LR">
<l>katoliko</l><r>Catholic</r>
</e> <e r="LR">
<l>katoliko</l><r>Catholic</r>
</e> <e r="RL">
<l>katoliko</l><r>Catholic</r>
</e>
13.04 It all the same!
francis.tyers: yep that says: translating left-to-right: katoliko<n><f> → Catholic<n> and katoliko<n><m> → Catholic<n>
13.05 mig: You could write "katalunino" to say a female Catalan person, but most people wouldnt care and would write "kataluno"
francis.tyers: translating from right-to-left, Catholic<n> → katoliko<n><GD> (GD = gender to be determined) mig: ah LR = left-to-right the directions
13.06 francis.tyers: yeah
mig: i have undersood francis.tyers: left-to-right = esperanto to english
Words
which is the lemma ? 13.08 mig: what is "lemma" ? (a lemming? :-)
francis.tyers: haha! no the base form root form citation form etc.
Why words needs also to be in the monolingual dictionary
treeview is not in the english dictionary
mig: ah couldnt it just suppose it to be a noun , then :-)
13.53 francis.tyers: nope
mig: or take it from the apertium-eo-en.eo.dix francis.tyers: everything to be translated needs to be in the analyser how would it know the number ? how would it know treeview is singular and treeviews is plural ?
13.54 it could guess, but then how would it be able to distinguish between "to treeview" and "he treeviews" (which don't exist)
mig: so I need also to add the word to apertium-eo-en.en.dix.
and it has the same declination as all other verbs ?
mig: infibitive no, its quite skew :-) declination= ?
13.10 francis.tyers: conjugation
mig: I promise to learn the lingustic words within the week. francis.tyers: haha :D an idea mig: yes, all declinations (ways of conjugation) are all the same in Esperanto
#include clause
Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest? A: See apertium-en-es/apertium-en-es.en.metadix.xml:
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
  <alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
        <!-- symbols -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"  href="apertium-en-es.symbols.xml"/>
        <!-- paradigms -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.en.pardefs.xml"/>
And then in apertium-en-es.symbols.xml:
<?xml version="1.0" encoding="UTF-8"?>
  <sdefs>
    <sdef n="comp" />
    <sdef n="detnt" />
    <sdef n="predet" />
    <sdef n="past" />
    <sdef n="atn" />
TODO
- go through http://wiki.apertium.org/wiki/Monodix_basics and review the file (the apertium-eo-en.eo.dix file) - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix - make some wiki notes.
File from traduku.net
convert it into EN : EO
then tag the EO side and strip out the nouns and adjectives those are most important to start with then grab a corpus (wikipedia, or euro parl or something)
22.15 and order them by frequency of the english word
mig: why reorder? francis.tyers@gmail.com: higher frequency words are more important
22.16 if you translate "the" correctly, you cover ~50% of the text, if you translate "gable" correctly you cover maybe 0.5% 22.17 mig: yes, yes, but why bother if all words get in?
francis.tyers@gmail.com: because someone has to add the inflection for the english side the esperanto side is regular, but the english is not always regular
22.18 mig: OK, so reording is important because we probably wont make all 110000.
francis.tyers@gmail.com: yep but the good news is we don't need to make 110000 we have 93% coverage with ~7,000 words so we can get 99% coverage with probably 20,000

