Difference between revisions of "Users guide and notes Jacob"

From Apertium
Jump to navigation Jump to search
Line 93: Line 93:
   
   
  +
==#include clause==
  +
  +
Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest?
  +
A: See apertium-en-es/apertium-en-es.en.metadix.xml:
  +
<pre>
  +
<?xml version="1.0" encoding="UTF-8"?>
  +
  +
<dictionary>
  +
<alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
  +
  +
<!-- symbols -->
  +
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.symbols.xml"/>
  +
  +
<!-- paradigms -->
  +
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.en.pardefs.xml"/>
  +
  +
</pre>
  +
And then in apertium-en-es.symbols.xml:
  +
  +
<pre>
  +
<?xml version="1.0" encoding="UTF-8"?>
  +
  +
<sdefs>
  +
<sdef n="comp" />
  +
<sdef n="detnt" />
  +
<sdef n="predet" />
  +
<sdef n="past" />
  +
<sdef n="atn" />
  +
</pre>
   
   
Line 99: Line 128:
 
- add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix
 
- add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix
 
- make some wiki notes.
 
- make some wiki notes.
  +
  +
  +
  +
===File from traduku.net===
  +
convert it into EN : EO
  +
then tag the EO side
  +
and strip out the nouns and adjectives
  +
those are most important to start with
  +
then grab a corpus
  +
(wikipedia, or euro parl or something)
  +
22.15 and order them by frequency of the english word
  +
mig: why reorder?
  +
francis.tyers@gmail.com: higher frequency words are more important
  +
22.16 if you translate "the" correctly, you cover ~50% of the text, if you translate "gable" correctly you cover maybe 0.5%
  +
22.17 mig: yes, yes, but why bother if all words get in?
  +
francis.tyers@gmail.com: because someone has to add the inflection for the english side
  +
the esperanto side is regular, but the english is not always regular
  +
22.18 mig: OK, so reording is important because we probably wont make all 110000.
  +
francis.tyers@gmail.com: yep
  +
but the good news is we don't need to make 110000
  +
we have 93% coverage with ~7,000 words
  +
so we can get 99% coverage with probably 20,000

Revision as of 07:31, 26 August 2008

These are my notes for making the English-Esperanto translator but I might be usefull to the same kind of people like me who knows next to nothing about linguistics.

Ive installed standtard Ubuntu packages and theyre working fine:

Using Apertium

echo "Jeg vil gå en tur" | apertium da-sv
Jag vill gå en tur

or

$ echo "Jeg vil gå en tur" | apertium -d apertium-sv-da da-sv
Jag vill gå en tur


don't use the command apertium-translator, its old and deprecated!


how to add a missing word

You will need to add the word in both the source language monodix AND on the translation dictionary.

Example: I want to add "treeview" which is an English noun.

First I check if its in the English monodict apertium-eo-en.en.dix. If it isnt we'll need to add it.

First we need to find the regular noun paradigm in english The paradigm is 'house__n'. Why 'house' ? Just because it's a memorable example.


Understanding the files

<e r="LR">

<l>kataluno</l><r>Catalan</r>

</e> <e r="LR">

<l>kataluno</l><r>Catalan</r>

</e> <e r="RL">

<l>kataluno</l><r>Catalan</r>

</e> <e r="LR">

<l>katoliko</l><r>Catholic</r>

</e> <e r="LR">

<l>katoliko</l><r>Catholic</r>

</e> <e r="RL">

<l>katoliko</l><r>Catholic</r>

</e>

13.04 It all the same!

francis.tyers: yep
 that says:
 translating left-to-right: katoliko<n><f> → Catholic<n>
 and
 katoliko<n><m> → Catholic<n>
 

13.05 mig: You could write "katalunino" to say a female Catalan person, but most people wouldnt care and would write "kataluno"

francis.tyers: translating from right-to-left, Catholic<n> → katoliko<n><GD> (GD = gender to be determined)
mig: ah
 LR = left-to-right
 the directions

13.06 francis.tyers: yeah

mig: i have undersood
francis.tyers: left-to-right = esperanto to english


Words

which is the lemma ? 13.08 mig: what is "lemma" ? (a lemming? :-)

francis.tyers: haha!
 no
 the base form
 root form
 citation form
 etc.


Why words needs also to be in the monolingual dictionary

treeview is not in the english dictionary

mig: ah
 couldnt it just suppose it to be a noun , then :-)

13.53 francis.tyers: nope

mig: or take it from the apertium-eo-en.eo.dix
francis.tyers: everything to be translated needs to be in the analyser
 how would it know the number ?
 how would it know treeview is singular and treeviews is plural ?

13.54 it could guess, but then how would it be able to distinguish between "to treeview" and "he treeviews" (which don't exist)

mig: so I need also to add the word to apertium-eo-en.en.dix.

and it has the same declination as all other verbs ?

mig: infibitive
 no, its quite skew :-)
 declination= ?

13.10 francis.tyers: conjugation

mig: I promise to learn the lingustic words within the week.
francis.tyers: haha :D
 an idea
mig: yes, all declinations (ways of conjugation) are all the same in Esperanto


#include clause

Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest? A: See apertium-en-es/apertium-en-es.en.metadix.xml:

<?xml version="1.0" encoding="UTF-8"?>

<dictionary>
  <alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>

        <!-- symbols -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"  href="apertium-en-es.symbols.xml"/>

        <!-- paradigms -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.en.pardefs.xml"/>

And then in apertium-en-es.symbols.xml:

<?xml version="1.0" encoding="UTF-8"?>

  <sdefs>
    <sdef n="comp" />
    <sdef n="detnt" />
    <sdef n="predet" />
    <sdef n="past" />
    <sdef n="atn" />


TODO

- go through http://wiki.apertium.org/wiki/Monodix_basics and review the file (the apertium-eo-en.eo.dix file)
- add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix 
- make some wiki notes.


File from traduku.net

convert it into EN : EO

 then tag the EO side
 and strip out the nouns and adjectives
 those are most important to start with
 then grab a corpus
 (wikipedia, or euro parl or something)

22.15 and order them by frequency of the english word

mig: why reorder?
francis.tyers@gmail.com: higher frequency words are more important

22.16 if you translate "the" correctly, you cover ~50% of the text, if you translate "gable" correctly you cover maybe 0.5% 22.17 mig: yes, yes, but why bother if all words get in?

francis.tyers@gmail.com: because someone has to add the inflection for the english side
 the esperanto side is regular, but the english is not always regular

22.18 mig: OK, so reording is important because we probably wont make all 110000.

francis.tyers@gmail.com: yep
 but the good news is we don't need to make 110000
 we have 93% coverage with ~7,000 words
 so we can get 99% coverage with probably 20,000