Difference between revisions of "Users guide and notes Jacob"

Latest revision as of 05:45, 29 December 2011

1 Using Apertium
2 how to add a missing word
- 2.1 Understanding the files
3 Why words needs also to be in the monolingual dictionary
4 Why do pairs with the same language (e.g. English) not share the English monodix?
5 #include clause
- 5.1 Using <xi:include/>
- 5.2 Using shell tools, like cat, head and tail
6 TODO
- 6.1 File from traduku.net

These are my notes for making the English-Esperanto translator but I might be useful to the same kind of people like me who knows next to nothing about linguistics.

Ive installed standtard Ubuntu packages and theyre working fine:

Using Apertium[edit]

echo "Jeg vil gå en tur" | apertium da-sv
Jag vill gå en tur

or

$ echo "Jeg vil gå en tur" | apertium -d apertium-sv-da da-sv
Jag vill gå en tur

don't use the command apertium-translator, its old and deprecated!

how to add a missing word[edit]

You will need to add the word in both the source language monodix AND on the translation dictionary.

Example: I want to add "treeview" which is an English noun.

First I check if its in the English monodict apertium-eo-en.en.dix. If it isnt we'll need to add it.

First we need to find the regular noun paradigm in english The paradigm is 'house__n'. Why 'house' ? Just because it's a memorable example.

Understanding the files[edit]

<e r="LR"><p><l>kataluno<s n="n"/><s n="f"/></l><r>Catalan<s n="n"/></r></p></e>
<e r="LR"><p><l>kataluno<s n="n"/><s n="m"/></l><r>Catalan<s n="n"/></r></p></e>
<e r="RL"><p><l>kataluno<s n="n"/><s n="GD"/></l><r>Catalan<s n="n"/></r></p></e>

<e r="LR"><p><l>katoliko<s n="n"/><s n="f"/></l><r>Catholic<s n="n"/></r></p></e>
<e r="LR"><p><l>katoliko<s n="n"/><s n="m"/></l><r>Catholic<s n="n"/></r></p></e>
<e r="RL"><p><l>katoliko<s n="n"/><s n="GD"/></l><r>Catholic<s n="n"/></r></p></e>
13.04 It all the same!
 francis.tyers: yep
  that says:
  translating left-to-right: katoliko<n><f> → Catholic<n>
  and
  katoliko<n><m> → Catholic<n>
  
13.05 mig: You could write "katalunino" to say a female Catalan person, but most people wouldnt care and would write "kataluno"
 francis.tyers: translating from right-to-left, Catholic<n> → katoliko<n><GD> (GD = gender to be determined)
 mig: ah
  LR = left-to-right
  the directions
13.06 francis.tyers: yeah
 mig: i have undersood
 francis.tyers: left-to-right = esperanto to english

Why words needs also to be in the monolingual dictionary[edit]

treeview is not in the english dictionary

mig: ah
 couldnt it just suppose it to be a noun , then :-)

13.53 francis.tyers: nope

mig: or take it from the apertium-eo-en.eo.dix
francis.tyers: everything to be translated needs to be in the analyser
 how would it know the number ?
 how would it know treeview is singular and treeviews is plural ?

13.54 it could guess, but then how would it be able to distinguish between "to treeview" and "he treeviews" (which don't exist)

mig: so I need also to add the word to apertium-eo-en.en.dix.

and it has the same declination as all other verbs ?

mig: infibitive
 no, its quite skew :-)
 declination= ?

13.10 francis.tyers: conjugation

mig: I promise to learn the lingustic words within the week.
francis.tyers: haha :D
 an idea
mig: yes, all declinations (ways of conjugation) are all the same in Esperanto

Why do pairs with the same language (e.g. English) not share the English monodix?[edit]

Why can't for example en-ca, en-es and en-eo all share the SAME English dictionary? > Then we could all contribute to this giant dict for the advantage of > all 3 projects? >

For each project if we want to add it to one dictionary, we need to add it to all of them. For example, if you want to add a word to es-en, you need to add it to all three dictionaries (en, en-es, es) -- in the appropriate form. Otherwise you get the @ # * symbols.

Because of this, and because not every has the time to edit, or speaks all of the languages, we find it more convienient to work with them separately, as language pairs, and then merge when/where possible. You'll note that most of the paradigm names, for example, are shared.

Although the ideal is for each dictionary to be "isolated", it isn't always like that. For example, there are some things it makes sense to distinguish in some language pairs and not in others.

#include clause[edit]

Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest?

Using <xi:include/>[edit]

A: You could use <xi:include/> as is done in apertium-en-es.

There are, however, a lot of limitations, which make it cumbersome:

The apertium tools doesen't support <xi:include/> directly, so instead of working on the files directly they will have to be preprocessed and then the result of this can be used in Apertium.
You will therefore have to make significant changes to the Makefile
The included files need to have a en enclosing tag, like <sdefs> below. If not then you'll have to invent one.

See apertium-en-es/apertium-en-es.en.metadix.xml:

<?xml version="1.0" encoding="UTF-8"?>

Here's how to do:
<dictionary>
  <alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>

        <!-- symbols -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"  href="apertium-en-es.symbols.xml"/>

        <!-- paradigms -->
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="apertium-en-es.en.pardefs.xml"/>

And then in apertium-en-es.symbols.xml:

<?xml version="1.0" encoding="UTF-8"?>

  <sdefs>
    <sdef n="comp" />
    <sdef n="detnt" />
    <sdef n="predet" />
    <sdef n="past" />
    <sdef n="atn" />

Using shell tools, like cat, head and tail[edit]

If you just want something simple like the #include in C/C++ then it might be much easier for you to just use cat, head and tail Unix shell commands. Imagine that you want to add in the end of the files, at the third last line:

  #include your file here
  
  </section>
</dictionary>

then you could just change your Makefile like (original has been prefixed by #):


$(PREFIX1).automorf.bin: $(BASENAME).$(LANG1).dix tradukunet.$(LANG1).dix
        (head -n -3 $(BASENAME).$(LANG1).dix; cat tradukunet.$(LANG1).dix; tail -n -3 $(BASENAME).$(LANG1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp lr tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(LANG1).dix
#       lt-comp lr $(BASENAME).$(LANG1).dix $@

$(PREFIX1).autobil.bin: $(BASENAME).$(PREFIX1).dix tradukunet.$(PREFIX1).dix
        (head -n -3 $(BASENAME).$(PREFIX1).dix; cat tradukunet.$(PREFIX1).dix; tail -n -3 $(BASENAME).$(PREFIX1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp lr tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(PREFIX1).dix
#       lt-comp lr $(BASENAME).$(PREFIX1).dix $@

$(PREFIX1).autogen.bin: $(BASENAME).$(LANG2).dix tradukunet.$(LANG2).dix
        (head -n -3 $(BASENAME).$(LANG2).dix; cat tradukunet.$(LANG2).dix; tail -n -3 $(BASENAME).$(LANG2).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp rl tmp.dix $@ 
#       apertium-validate-dictionary $(BASENAME).$(LANG2).dix
#       lt-comp rl $(BASENAME).$(LANG2).dix $@

$(PREFIX1).autopgen.bin: $(BASENAME).post-$(LANG2).dix
        apertium-validate-dictionary $(BASENAME).post-$(LANG2).dix
        lt-comp lr $(BASENAME).post-$(LANG2).dix $@

$(PREFIX2).automorf.bin: $(BASENAME).$(LANG2).dix tradukunet.$(LANG2).dix
        (head -n -3 $(BASENAME).$(LANG2).dix; cat tradukunet.$(LANG2).dix; tail -n -3 $(BASENAME).$(LANG2).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp lr tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(LANG2).dix
#       lt-comp lr $(BASENAME).$(LANG2).dix $@

$(PREFIX2).autobil.bin: $(BASENAME).$(PREFIX1).dix  tradukunet.$(PREFIX1).dix
        (head -n -3 $(BASENAME).$(PREFIX1).dix; cat tradukunet.$(PREFIX1).dix; tail -n -3 $(BASENAME).$(PREFIX1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp rl tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(PREFIX1).dix
#       lt-comp rl $(BASENAME).$(PREFIX1).dix $@

$(PREFIX2).autogen.bin: $(BASENAME).$(LANG1).dix tradukunet.$(LANG1).dix
        (head -n -3 $(BASENAME).$(LANG1).dix; cat tradukunet.$(LANG1).dix; tail -n -3 $(BASENAME).$(LANG1).dix) > tmp.dix
        apertium-validate-dictionary tmp.dix
        lt-comp rl tmp.dix $@
#       apertium-validate-dictionary $(BASENAME).$(LANG1).dix
#       lt-comp rl $(BASENAME).$(LANG1).dix $@

Here the included files are files are called tradukunet.*.dix.

TODO[edit]

- go through http://wiki.apertium.org/wiki/Monodix_basics and review the file (the apertium-eo-en.eo.dix file)
- add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix 
- make some wiki notes.

<jacobn> Ok, Ill try the web doc translator more, find the systematics, report a bug and attach files etc.

File from traduku.net[edit]

convert it into EN : EO

 then tag the EO side
 and strip out the nouns and adjectives
 those are most important to start with
 then grab a corpus
 (wikipedia, or euro parl or something)

22.15 and order them by frequency of the english word

mig: why reorder?
francis.tyers@gmail.com: higher frequency words are more important

22.16 if you translate "the" correctly, you cover ~50% of the text, if you translate "gable" correctly you cover maybe 0.5% 22.17 mig: yes, yes, but why bother if all words get in?

francis.tyers@gmail.com: because someone has to add the inflection for the english side
 the esperanto side is regular, but the english is not always regular

22.18 mig: OK, so reording is important because we probably wont make all 110000.

francis.tyers@gmail.com: yep
 but the good news is we don't need to make 110000
 we have 93% coverage with ~7,000 words
 so we can get 99% coverage with probably 20,000

Difference between revisions of "Users guide and notes Jacob"

Latest revision as of 05:45, 29 December 2011

Contents

Using Apertium[edit]

how to add a missing word[edit]

Understanding the files[edit]

Why words needs also to be in the monolingual dictionary[edit]

Why do pairs with the same language (e.g. English) not share the English monodix?[edit]

#include clause[edit]

Using <xi:include/>[edit]

Using shell tools, like cat, head and tail[edit]

TODO[edit]

File from traduku.net[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+{{TOCD}}
-These are my notes for making the English-Esperanto translator but I might be usefull to the same kind of people like me who knows next to nothing about linguistics.
+These are my notes for making the English-Esperanto translator but I might be useful to the same kind of people like me who knows next to nothing about linguistics.
 Ive installed standtard Ubuntu packages and theyre working fine:
@@ Line 84: / Line 85: @@
-==Why projects concerning the same languages (f.eks English) not share the English monolingual dictionary?==
+==Why do pairs with the same language (e.g. English) not share the English monodix?==
 Why can't for example en-ca, en-es and en-eo all share the SAME English dictionary?
-> Then we could all contribute to this gian dict for the advantage of
+> Then we could all contribute to this giant dict for the advantage of
 > all 3 projects?
 >
@@ Line 103: / Line 104: @@
 always like that. For example, there are some things it makes sense to
 distinguish in some language pairs and not in others.
 ==#include clause==
 Q: In general, is there a way to do something like an #include clause so that I could keep my additions seperate for the rest?
-A: See apertium-en-es/apertium-en-es.en.metadix.xml:
+===Using <nowiki><xi:include/></nowiki>===
+A: You could use <nowiki><xi:include/></nowiki> as is done in apertium-en-es.
+There are, however, a lot of limitations, which make it cumbersome:
+* The apertium tools doesen't support <nowiki><xi:include/></nowiki> directly, so instead of working on the files directly they will have to be preprocessed and then the result of this can be used in Apertium.
+* You will therefore have to make significant changes to the Makefile
+* The included files need to have a en enclosing tag, like <sdefs> below. If not then you'll have to invent one.
+See apertium-en-es/apertium-en-es.en.metadix.xml:
 <pre>
 <?xml version="1.0" encoding="UTF-8"?>
+Here's how to do:
 <dictionary>
   <alphabet>·ÀÁÂÄÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
@@ Line 135: / Line 145: @@
 </pre>
+===Using shell tools, like cat, head and tail===
+If you just want something simple like the #include in C/C++ then it might be much easier for you to just use cat, head and tail Unix shell commands.
+Imagine that you want to add in the end of the files, at the third last line:
+<pre>
+  #include your file here
+  </section>
+</dictionary>
+</pre>
+then you could just change your Makefile like (original has been prefixed by #):
+<pre>
+$(PREFIX1).automorf.bin: $(BASENAME).$(LANG1).dix tradukunet.$(LANG1).dix
+        (head -n -3 $(BASENAME).$(LANG1).dix; cat tradukunet.$(LANG1).dix; tail -n -3 $(BASENAME).$(LANG1).dix) > tmp.dix
+        apertium-validate-dictionary tmp.dix
+        lt-comp lr tmp.dix $@
+#       apertium-validate-dictionary $(BASENAME).$(LANG1).dix
+#       lt-comp lr $(BASENAME).$(LANG1).dix $@
+$(PREFIX1).autobil.bin: $(BASENAME).$(PREFIX1).dix tradukunet.$(PREFIX1).dix
+        (head -n -3 $(BASENAME).$(PREFIX1).dix; cat tradukunet.$(PREFIX1).dix; tail -n -3 $(BASENAME).$(PREFIX1).dix) > tmp.dix
+        apertium-validate-dictionary tmp.dix
+        lt-comp lr tmp.dix $@
+#       apertium-validate-dictionary $(BASENAME).$(PREFIX1).dix
+#       lt-comp lr $(BASENAME).$(PREFIX1).dix $@
+$(PREFIX1).autogen.bin: $(BASENAME).$(LANG2).dix tradukunet.$(LANG2).dix
+        (head -n -3 $(BASENAME).$(LANG2).dix; cat tradukunet.$(LANG2).dix; tail -n -3 $(BASENAME).$(LANG2).dix) > tmp.dix
+        apertium-validate-dictionary tmp.dix
+        lt-comp rl tmp.dix $@
+#       apertium-validate-dictionary $(BASENAME).$(LANG2).dix
+#       lt-comp rl $(BASENAME).$(LANG2).dix $@
+$(PREFIX1).autopgen.bin: $(BASENAME).post-$(LANG2).dix
+        apertium-validate-dictionary $(BASENAME).post-$(LANG2).dix
+        lt-comp lr $(BASENAME).post-$(LANG2).dix $@
+$(PREFIX2).automorf.bin: $(BASENAME).$(LANG2).dix tradukunet.$(LANG2).dix
+        (head -n -3 $(BASENAME).$(LANG2).dix; cat tradukunet.$(LANG2).dix; tail -n -3 $(BASENAME).$(LANG2).dix) > tmp.dix
+        apertium-validate-dictionary tmp.dix
+        lt-comp lr tmp.dix $@
+#       apertium-validate-dictionary $(BASENAME).$(LANG2).dix
+#       lt-comp lr $(BASENAME).$(LANG2).dix $@
+$(PREFIX2).autobil.bin: $(BASENAME).$(PREFIX1).dix  tradukunet.$(PREFIX1).dix
+        (head -n -3 $(BASENAME).$(PREFIX1).dix; cat tradukunet.$(PREFIX1).dix; tail -n -3 $(BASENAME).$(PREFIX1).dix) > tmp.dix
+        apertium-validate-dictionary tmp.dix
+        lt-comp rl tmp.dix $@
+#       apertium-validate-dictionary $(BASENAME).$(PREFIX1).dix
+#       lt-comp rl $(BASENAME).$(PREFIX1).dix $@
+$(PREFIX2).autogen.bin: $(BASENAME).$(LANG1).dix tradukunet.$(LANG1).dix
+        (head -n -3 $(BASENAME).$(LANG1).dix; cat tradukunet.$(LANG1).dix; tail -n -3 $(BASENAME).$(LANG1).dix) > tmp.dix
+        apertium-validate-dictionary tmp.dix
+        lt-comp rl tmp.dix $@
+#       apertium-validate-dictionary $(BASENAME).$(LANG1).dix
+#       lt-comp rl $(BASENAME).$(LANG1).dix $@
+</pre>
+Here the included files are files are called <code>tradukunet.*.dix</code>.
 ==TODO==
@@ Line 140: / Line 212: @@
  - add treeview (and others added to apertium-eo-en.eo-en.dix) to the English monodix
  - make some wiki notes.
+<pre>
+<jacobn> Ok, Ill try the web doc translator more, find the systematics, report a bug and attach files etc.
+</pre>
@@ Line 162: / Line 238: @@
   we have 93% coverage with ~7,000 words
   so we can get 99% coverage with probably 20,000
+[[Category:English and Esperanto]]