Difference between revisions of "User:Unhammer/wishlist"

Revision as of 16:06, 13 March 2010

My wishlist for Apertium features (mostly just useful for language pair developers).

Fallthrough option in transfer

Some times, you match an input pattern in a rule, eg. "n vblex", and you check whether the target-language n has some feature, and then only if it has that feature do you do something special with it. It would be great if we could specify in the <otherwise> that we want to fall through, ignoring that this rule matched.

There are two options for how to "ignore", the best (but possibly slowest?) would be to go on with trying to match on the rest of the rules, the other option is to act as if no rules matched. Both would be an improvement.

UTF-8 in sdefs

But, being XML id's, this is maybe not possible? (At least for the first character.)

Keep surface ("superficial") forms at least until transfer

Right now, all steps of the pipeline up until apertium-tagger support keeping the surface forms along with the lemma:

$ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin
^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
$ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin 
^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
$ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin | apertium-tagger -p -g nb-nn.prob 
^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$

(The -w switch to lt-proc makes sure the lemma has the same typographical case as given by the dictionary.)

It would be useful to have surface form and lemma separate in apertium-transfer too; mostly because we would then be able to avoid all those horrible hacks with trying to maintain typographical case.

Consider:

C-vitaminets effekt => Effekten til C-vitaminet
Vitaminets effekt => Effekten til vitaminet

The reason for keeping the case on "C-vitaminet" but not "Vitaminet" should be that the lemma is capitalised. However, before transfer, the case from surface form is applied to the lemma, and we don't know whether it was there from before or not. This is the input to the transfer module:

^C-vitamin<n><nt><sg><def><gen>$ ^effekt<n><m><sg><ind>$
^Vitamin<n><nt><sg><def><gen>$ ^effekt<n><m><sg><ind>$

So how can you avoid *"Effekten til Vitaminet" or *"Effekten til c-vitaminet"? (At the moment, this is dealt with in nn-nb by using only lowercase lemmata for stuff like "C-vitamin", and RL entries which apply correct capitalisation -- not very pretty, and pardefs don't really help here.)

See how it is done in is-en with gentilics, e.g. "English-speaking", etc. - Francis Tyers 19:56, 11 March 2010 (UTC)

Switched to that method as it's slightly better, but still... <e lm="BCG-vaksine"><par n="Bb"/><par n="Cc"/><par n="Gg"/>-vaksin<par n="r/e__n"/></e> --unhammer 08:26, 12 March 2010 (UTC)

Solution:

If transfer could read

^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
^Vitaminets/vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$

then we could keep the capitalisation on C-vitamin because we see that the lemma has capitalisation, while we change "Vitamin" to "vitamin" since the lemma is regular lowercased.

Other considerations:

The transfer.dtd would of course need a new attribute like part="sform".

By interchunk I guess we can throw away the surface form.

640K should be enough for anyone.

apertium-pretransfer changes ^ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$ into ^ombud<n><nt><sg><ind><ep-s>$ ^kvinne<n><f><sg><ind>$.

So, should

^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$ become
^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>$ ^Ombudskvinne/kvinne<n><f><sg><ind>$ or
^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>$ ^ombudskvinne/kvinne<n><f><sg><ind>$?

Should

^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$ become
^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>$ ^ombudskvinne/kvinne<n><f><sg><ind>$ or
^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>$ ^OMBUDSKVINNE/kvinne<n><f><sg><ind>$?

If you, in transfer, know that they used to be part of the same lexical unit in the source language, this probably doesn't matter too much.

Allow the chunk tag wherever we allow other "strings"

<chunk name="foo"><tags><tag><lit-tag v="bar"/></tag></tags><lu><lit v="fie"/></lu></chunk> just outputs ^foo<bar>{fie}$ -- a simple string. We can have strings from tags, literals and variables inside variables, but not with the chunk tag, leading to this kind of mess:

        <let>
                       
           <concat>
             <lit v="^pron"/>
             <lit-tag v="@SUBJ→"/>
             <clip pos="1" part="pers"/>
             <lit-tag v="GD"/>
             <clip pos="1" part="nbr"/>
             <lit-tag v="nom"/>
             <lit v="{^"/>
             <lit v="prpers"/>
             <lit-tag v="prn"/>
             <clip pos="1" part="pers"/>
             <lit-tag v="mf"/>
             <clip pos="1" part="nbr"/>
             <lit-tag v="nom"/>
             <lit v="$}$"/>
             
           </concat>
         </let>

Wish: allow <let><chunk>...</chunk></let> and <concat><chunk>...</chunk></concat> (chunk "returns" a string, variables hold strings).

Allow "postchunking" of chunks in interchunk

When you want to merge chunks in interchunk it would be nice to be able to collapse the tags of non-head chunks.

For example, if we want to do: SN PREP SN. "The 10 most popular films in American cinemas", we get:

t1x:
^Det_num_adj_nom<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$}$ 
^í<PREP>{^in<pr>$}$ 
^adj_nom<SN><@X><pl>{^American<adj>$ ^cinema<n><3>$}$

t2x:
^sn_prep_sn<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$ ^in<pr>$ ^American<adj>$ ^cinema<n><3>$}$

The <3> is replaced with <pl> in postchunk. In this case 'pl' is the same in both, but if not, it would be nice to be able to do something like

    <rule comment="REGLA: SN PREP SN">
      <pattern>
        <pattern-item n="SN"/>
        <pattern-item n="PREP"/>
        <pattern-item n="SN"/>
      </pattern>
      <action>
        <out>
          <chunk>
            <lit v="sn_prep_sn"/>
            <clip pos="1" part="tags"/>
            <lit v="{"/>
              <clip pos="1" part="content"/>
              <b pos="1"/>
              <clip pos="2" part="content"/>
              <b pos="2"/>
              <merge-tags>
                <clip pos="3" part="tags"/>
                <clip pos="3" part="content"/>
              </merge-tags>
            <lit v="}"/>
          </chunk>
        </out>
      </action>
    </rule>

so that we get

 ^sn_prep_sn<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$ ^in<pr>$ ^American<adj>$ ^cinema<n><pl>$}

A "grouping" tag for bidix

Most of the time when LR-ing and RL-ing in bidix, we have one pair of entries that work in both directions, with possibly lots of LR's that all go to the same <r>, or lots of RL's that all go to the same <l>. Making certain these actually _do_ go to the same, where they should, means looking through lots of entries manually, since in some cases we _don't_ want it to be like that (ie. we can't just write a program to check this since there are general rules and there are exceptions).

What I'd like is just some way of keeping LR's and RL's in bidix together. One possibility would be to represent it this way:

 <eg>
   <em>       <p><l>foo</l><r>bar</r></p></em>
   <LR>        <p><l>fie</l>                    </p></LR>
   <RL>        <p>                  <r>bum</r></p></RL>
 </eg>
 <e r="LR"><p><l>foe</l><r>baz</r></p></e>

This would be equivalent to:

 <e>           <p><l>foo</l><r>bar</r></p></e>
 <e r="LR"><p><l>fie</l><r>bar</r></p></e>
 <e r="RL"><p><l>foo</l><r>bum</r></p></e>
 <e r="LR"><p><l>foe</l><r>baz</r></p></e>

The idea is that within the <eg> entries, we know that all LR's have the same <r>, and all RL's have the same <l>, and so an LR can't have an <r> specified.

Better apertium-gen-modes

apertium-gen-modes is used for two purposes:

making local modes files for, used like apertium -d . nn-nb
making installable modes files, used like apertium nn-nb

Unfortunately, each time you sudo make install, the local ones are overwritten by files which have root ownership. Very annoying.

To avoid this, the Makefile.am in apertium-nn-nb currently has

modes/$(PREFIX1).mode: modes.xml
	apertium-gen-modes modes.xml
	cp *.mode modes/

modes/$(PREFIX2).mode: modes.xml 
	apertium-gen-modes modes.xml
	cp *.mode modes/

apertium_nn_nb_DATA= […]
	            modes/$(PREFIX1).mode modes/$(PREFIX2).mode modes.xml

install-data-local:
	mv modes modes.bak
	apertium-gen-modes modes.xml apertium-$(PREFIX1)
	rm -rf modes
	mv modes.bak modes
	test -d $(apertium_nn_modesdir) || mkdir $(apertium_nn_modesdir)
	$(INSTALL_DATA) $(PREFIX1).mode $(apertium_nn_modesdir)
	$(INSTALL_DATA) $(PREFIX2).mode $(apertium_nn_modesdir)
	rm $(PREFIX1).mode $(PREFIX2).mode

There must be a better way. One could shorten it down to

modes/$(PREFIX1).mode: modes.xml
	apertium-gen-modes modes.xml

modes/$(PREFIX2).mode: modes.xml 
	apertium-gen-modes modes.xml

noinst_DATA=modes/$(PREFIX1).mode modes/$(PREFIX2).mode modes.xml

install-data-local:
	apertium-gen-modes modes.xml apertium-$(PREFIX1)
	test -d $(apertium_nn_modesdir) || mkdir $(apertium_nn_modesdir)
	$(INSTALL_DATA) $(PREFIX1).mode $(apertium_nn_modesdir)
	$(INSTALL_DATA) $(PREFIX2).mode $(apertium_nn_modesdir)
	rm $(PREFIX1).mode $(PREFIX2).mode

by applying

Index: apertium/apertium-createmodes.awk
===================================================================
--- apertium/apertium-createmodes.awk	(revision 20175)
+++ apertium/apertium-createmodes.awk	(working copy)
@@ -8,13 +8,12 @@
   }
   else if(HEAD != 0)
   {
-    myfilename = NAME ".mode";
-    if(ARR[3] == "yes")
+    if(ARR[3] == "yes" || install == "no")
     {
-      myfilename = "../" myfilename;
+      myfilename = NAME ".mode";
+      # fool code because a bug in mawk
+      printf $0 "\n"  >> myfilename;
+      close(myfilename);
     }
-    # fool code because a bug in mawk
-    printf $0 "\n"  >> myfilename;
-    close(myfilename);
   }
 }
Index: apertium/Makefile.am
===================================================================
--- apertium/Makefile.am	(revision 20175)
+++ apertium/Makefile.am	(working copy)
@@ -329,7 +329,7 @@
 	@cat modes-header.sh >> $@
 	@echo "$(XMLLINT) --dtdvalid $(apertiumdir)/modes.dtd --noout \$$FILE1 && \\" >> $@
 	@if [ `basename $(XSLTPROC)` == xsltproc ]; \
-	  then echo "$(XSLTPROC) --stringparam prefix $(prefix)/bin --stringparam dataprefix \$$FULLDIRNAME  $(apertiumdir)/modes2bash.xsl \$$FILE1 | awk -f $(apertiumdir)/apertium-createmodes.awk PARAM=\$$FULLDIRNAME"; \
+	  then echo "$(XSLTPROC) --stringparam prefix $(prefix)/bin --stringparam dataprefix \$$FULLDIRNAME  $(apertiumdir)/modes2bash.xsl \$$FILE1 | awk -f $(apertiumdir)/apertium-createmodes.awk PARAM=\$$FULLDIRNAME install=\$$INSTALL"; \
           else echo "$(XSLTPROC) $(apertiumdir)/modes2bash.xsl \$$FILE1 \\\$$prefix=$(prefix)/bin \\\$$dataprefix=\$$FULLDIRNAME| awk -f $(apertiumdir)/apertium-createmodes.awk PARAM=\$$FULLDIRNAME"; \
           fi >> $@ 
 	@chmod a+x $@
Index: apertium/modes-header.sh
===================================================================
--- apertium/modes-header.sh	(revision 20175)
+++ apertium/modes-header.sh	(working copy)
@@ -17,15 +17,17 @@
 
 rm -Rf *.mode
 
-if [ ! -d $FULLDIRNAME/modes ]
-then mkdir $FULLDIRNAME/modes
-else rm -Rf $FULLDIRNAME/modes && mkdir $FULLDIRNAME/modes
-fi
-
 FILE1=$FULLDIRNAME/$(basename $1)
-cd $FULLDIRNAME/modes
 
-if [ $# -eq 2 ]; then
+if [ $# -eq 1 ]; then
+	INSTALL="no"
+	if [ -d $FULLDIRNAME/modes ]; then
+		rm -Rf $FULLDIRNAME/modes
+	fi
+	mkdir $FULLDIRNAME/modes
+	cd $FULLDIRNAME/modes
+elif [ $# -eq 2 ]; then
+	INSTALL="yes"
 	PREFIX=$2;
 	FULLDIRNAME=$APERTIUMDIR"/"$PREFIX;
 fi

but then a lot of Makefiles would have to be changed...

Difference between revisions of "User:Unhammer/wishlist"

Revision as of 16:06, 13 March 2010

Contents

Fallthrough option in transfer

UTF-8 in sdefs

Keep surface ("superficial") forms at least until transfer

Consider:

Solution:

Other considerations:

Allow the chunk tag wherever we allow other "strings"

Allow "postchunking" of chunks in interchunk

A "grouping" tag for bidix

Better apertium-gen-modes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 107: / Line 107: @@
 </pre>
-In this case 'pl' is the same in both, but if not, it would be nice to be able to do something like
+The <code><3></code> is replaced with <code><pl></code> in postchunk. In this case 'pl' is the same in both, but if not, it would be nice to be able to do something like
 <pre>
@@ Line 135: / Line 135: @@
       </action>
     </rule>
+</pre>
+so that we get
+<pre>
+ ^sn_prep_sn<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$ ^in<pr>$ ^American<adj>$ ^cinema<n><pl>$}
 </pre>