Difference between revisions of "Talk:Constraint-based lexical selection module"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:


Here is a suggestion of how lists should work.

The list would be intepreted as a sequence of "match" tags, eg. when we compile, we just expand "list-items" to "match"s. This means that if we want it to be a list in the normal sense, we need to put it in an "or". So, it could also be used as a kind of templatey thing. But I'm not ure if this is overloading.



<list n="CUMPLIR-N">
<list-item lemma="urte"/>
<list-item lemma="agindu"/>
<list-item lemma="amets"/>
<list-item lemma="arau"/>
<list-item lemma="asmo"/>
<list-item lemma="baldintza"/>
<list-item lemma="esan" tags="n"/>
<list-item lemma="eskubide"/>
<list-item lemma="helburu"/>
<list-item lemma="hitz"/>
<list-item lemma="irizpide"/>
<list-item lemma="lan"/>
<list-item lemma="lege"/>
<list-item lemma="obligazio"/>
<list-item lemma="proposamen"/>
<list-item lemma="erregelamendu"/>

<match list="CUMPLIR-N"/>
<match lemma="ez"/>
<match lemma="izan"/>
<match lemma="bete"><select lemma="cumplir"/></match>
<rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->

<list v="CUMPLIR-N"/>
<match lemma="ez"/>
<match lemma="izan"/>
<match lemma="la"/>
<match lemma="bete"><select lemma="cumplir"/></match>
<rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->



==Rule formats==
==Rule formats==

Revision as of 19:52, 13 September 2012


Here is a suggestion of how lists should work.

The list would be intepreted as a sequence of "match" tags, eg. when we compile, we just expand "list-items" to "match"s. This means that if we want it to be a list in the normal sense, we need to put it in an "or". So, it could also be used as a kind of templatey thing. But I'm not ure if this is overloading.


    <list n="CUMPLIR-N">
      <list-item lemma="urte"/>
      <list-item lemma="agindu"/>
      <list-item lemma="amets"/>
      <list-item lemma="arau"/>
      <list-item lemma="asmo"/>
      <list-item lemma="baldintza"/>
      <list-item lemma="esan" tags="n"/>
      <list-item lemma="eskubide"/>
      <list-item lemma="helburu"/>
      <list-item lemma="hitz"/>
      <list-item lemma="irizpide"/>
      <list-item lemma="lan"/>
      <list-item lemma="lege"/>
      <list-item lemma="obligazio"/>
      <list-item lemma="proposamen"/>
      <list-item lemma="erregelamendu"/>

      <match list="CUMPLIR-N"/>
    <match lemma="ez"/>
    <match lemma="izan"/>
    <match lemma="bete"><select lemma="cumplir"/></match>
  <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->

      <list v="CUMPLIR-N"/>
    <match lemma="ez"/>
    <match lemma="izan"/>
    <match lemma="la"/>
    <match lemma="bete"><select lemma="cumplir"/></match>
  <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->


Rule formats

Note: Felipe doesn't like "skip".

can't say I do either … sounds like a command rather than a constraint --unhammer 11:42, 18 November 2011 (UTC)
also, in the <or>, should we read them as independent of each other? that's a bit confusing since otherwise they're all required and have a certain order --unhammer 11:42, 18 November 2011 (UTC)

The regular expression for the OR below is:

(nazi<adj>[0-9A-Za-z <>]*|totalitari<adj>[0-9A-Za-z <>]*|feixista<adj>[0-9A-Za-z <>]*|franquista<adj>[0-9A-Za-z <>]*|militar<adj>[0-9A-Za-z <>]*|fiscal<adj>[0-9A-Za-z <>]*)

- Francis Tyers 11:53, 18 November 2011 (UTC)

    <remove lemma="règim" tags="n.*">
      <acception lemma="diet" tags="n.*"/>
      <skip lemma="nazi" tags="adj.*"/>
      <skip lemma="totalitari" tags="adj.*"/>
      <skip lemma="feixista" tags="adj.*"/>
      <skip lemma="franquista" tags="adj.*"/>
      <skip lemma="militar" tags="adj.*"/>
      <skip lemma="fiscal" tags="adj.*"/>
    <remove lemma="règim" tags="n.*">
      <acception lemma="diet" tags="n.*"/>
      <pattern lemma="nazi" tags="adj.*"/>
      <pattern lemma="totalitari" tags="adj.*"/>
      <pattern lemma="feixista" tags="adj.*"/>
      <pattern lemma="franquista" tags="adj.*"/>
      <pattern lemma="militar" tags="adj.*"/>
      <pattern lemma="fiscal" tags="adj.*"/>
    <remove-from lemma="règim" tags="n.*">
      <translation lemma="diet" tags="n.*"/>
        <pattern-item lemma="nazi" tags="adj.*"/>
        <pattern-item lemma="totalitari" tags="adj.*"/>
        <pattern-item lemma="feixista" tags="adj.*"/>
        <pattern-item lemma="franquista" tags="adj.*"/>
        <pattern-item lemma="militar" tags="adj.*"/>
        <pattern-item lemma="fiscal" tags="adj.*"/>

  <rule c="la dona dels seus somnis">
    <select-for lemma="dona" tags="n.*">
      <translation lemma="wife" tags="n.*"/>
      <pattern-item lemma="de" tags="pr.*"/>
      <pattern-item lemma="*" tags="det.pos.*"/>
      <pattern-item lemma="somni" tags="n.*"/>

    <target lemma="règim" tags="n.*">
      <remove lemma="diet" tags="n.*"/>

  <rule c="la dona dels seus somnis">
    <target lemma="dona" tags="n.*">
      <select lemma="wife" tags="n.*"/>


s	("estació" n)	("season" n)	(1 "plujós")
s	("estació" n)	("season" n)	(2 "plujós")
s	("estació" n)	("season" n)	(1 "de") (3 "any")
s	("estació" n)	("station" n)	(1 "de") (3 "Línia")
s	("prova" n)	("evidence" n)	(1 "arqueològic")
s	("prova" n)	("test" n)	(1 "estadístic")
s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
s	("prova" n)	("testing" n)	(-2 "tècnica") (-1 "de") 
s	("joc" n)	("game" n)	(1 "olímpic")
s	("joc" n)	("set" n)	(1 "de") (2 "caràcter")
r	("pista" n)	("hint" n)	(1 "més") (2 "llarg")
r	("pista" n)	("clue" n)	(1 "més") (2 "llarg")
r	("motiu" n)	("motif" n)	(-1 "aquest") (-2 "per")
s	("carn" n)	("flesh" n)	(1 "i") (2 "os")
s	("sobre" pr)	("over" n)	(-1 "victòria")
s       ("dona" n)      ("wife" n)      (-1 "*" det pos)
s       ("dona" n)      ("wife" n)      (-1 "el") (1 "de")
s       ("dona" n)      ("woman" n)     (1 "de") (2 "*" det pos) (3 "somni")
r       ("patró n)      ("pattern" n)   (1 "*" np ant)

Jacob's critique of the rule format

By 'skip' you actually means 'match'. And lemma selection should be called <select> :-)

So I'd write

  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">  <select lemma="wife"/>  </match>
  <match lemma="de"/>

and actually you would like to work on categories as well as lemmas.

Like, to prefer (human) beings and not things before "feel".

  <match tags="n.*">  <select cat="beings"/>  </match>
  <match lemma="feel"/>

--Jacob Nordfalk 12:22, 30 November 2011 (UTC)

I like this a lot actually!
    <match lemma="estació"> <select lemma="season"/></match>
    <match lemma="plujós" tags="adj.*"/>

It seems much more consistent... Would you allow every match to have a rule, or each rule to have more than one operation ? e.g.

    <match lemma="juego"> <select lemma="set"/></match>
    <match lemma="de"/>
    <match lemma="etiqueta"> <select lemma="tag"/></match>

- Francis Tyers 11:31, 1 December 2011 (UTC)

Idea de Sergio

  <word source-lemma="pista" source-tags="n.*">
    <remove-acception target-lemma="hint" target-tags="n.*"/>
    <remove-acception target-lemma="clue" target-tags="n.*"/>
  <word source-lemma="més" source-tags="preadv.*"/>
  <word source-lemma="llarg" source-tags="adj.*"/>

(10:44:10) Sergio Ortiz Rojas: mi objetivo es darle a la idea de jacob la idea de expresar la idea de bilingüe en la notación
(10:48:06) Sergio Ortiz Rojas: yo creo que lo que prefieren jacob y felipe es mejor que lo original
(10:48:19) Sergio Ortiz Rojas: la única crítica es que si lo dejáis así será demasiado "programático"
(10:48:19) Sergio Ortiz Rojas: es decir
(10:48:24) Sergio Ortiz Rojas: demasiado para gente que escribe programas
(10:48:26) Sergio Ortiz Rojas: y poco declarativo

Old application strategy

The following is an inefficient implementation of the rule application process:

# s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
# tipus = "select";
# centre = "^prova<n>.*"
# tl_patro = ["^event<n>.*"]
# sl_patro = {-3: "^guanyador<", -2: "^de<"}

CLASS Rule: 
        tipus = enum('select', 'remove')
        centre = '';
        tl_patro = [];
        sl_patro = {};

rule_table = {}; # e.g. rule_table["estació"] = [rule1, rule2, rule3];
i = 0

DEFINE ApplyRule(rule, lu): 

    FOREACH target IN lu.tl: 
        SWITCH rule.tipus:
                 IF target NOT IN rule.tl_patro:
                     DELETE target
                 IF target IN rule.tl_patro:
                     DELETE target

FOREACH pair(sl, tl) IN sentence:
    FOREACH centre IN rule_table: 

        IF centre IN sl: 

            FOREACH rule IN rule_table[centre]: 

                matched = False   

                FOREACH context_item IN rule_table[centre][rule]: 

                  IF context_item in sentence: 
                      matched = True
                      matched = False
                # If all of the context items have matched, and none of them have not matched
                # if a rule matches break and continue to the pair. 
                IF matched == True:
                      sentence[i] = ApplyRule(rule_table[centre][rule], sentence[i])

    i = i + 1

How Brian writes rules

In order to to come up with lexical rules, I use a wide variety of methods. Generally, I begin by thinking of common English words which can be used in several different contexts. Often, I look around the room for inspiration or write down ambiguous English words that I come across in my reading. Using this list of English words, I consult my English-Spanish dictionary (Collins Spanish Concise Dictionary, 6th Edition) and then check online to see if indeed, the word changes in Spanish based on the context. While checking on the internet, I visit several websites to ensure accuracy. One that I've found to be particularly reliable is www.spanishdict.com, which has an extensive English-Spanish translation dictionary, and covers most contexts of any given English word, along with many English idioms.

After I locate a word both online and in my English-Spanish dictionary, I compare the Spanish word each gives for a English word depending on the context. If the resulting word is the same, in both then I proceed to actually writing the rule. If not, I continue checking to try and come up with a consensus. If after repeated attempts I can't find the best word, I move and on and try a different word. If I am able to find with a consensus word, I move on to writing the rule.

Once I've found the word I want to use in a certain context, I begin writing the rule, paying particular attention to make sure it's a lexical selection rule rather than a multiword unit. Writing the rule is likely the easiest part of the process, as there is a template to follow. After I finish writing all of the rules, I validate first the file I'm writing the rules in, and if it validates I copy the rules into the apertium-en-es.en-es.lrx file. Before sending this file up to the SVN, I check to make sure it too validates.

Usage (old)

$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null
^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$ 
With rules
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

rainiest season 
is the 
autumn, and the 
driest the 
With bilingual dictionary defaults
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

rainiest station 
is the 
autumn, and the 
driest the 

Work plan (12th March 2012)

  1. Learn rules from a corpus where the best translation is selected by the TLM
  2. Try changing the parameters of the LM
    1. Remove stopwords
    2. Rank on lemmata
  3. Try neighbouring word permutations, and ranking on LM
    1. Remove stopwords
    2. Rank on lemmata
  4. Run tests on eu-es
TLM (s5) 248/603 41.1%
TLM (l5) 211/603 34.9%
TLM (n5) 271/603 44.9%
TLM (l5+flip2) 220/603 36.4%
TLM (ptes-s5) -/603 48.4%
TLM (fres-s5) 280/603 46.4%

Programs to make:

  1. Program that takes biltrans as output, generates all possible combos, translates, scores on language model takes the biltrans output with the highest LM score, and compares against a reference biltrans output.