Talk:Constraint-based lexical selection module

Lists

Here is a suggestion of how lists should work.

The list would be intepreted as a sequence of "match" tags, eg. when we compile, we just expand "list-items" to "match"s. This means that if we want it to be a list in the normal sense, we need to put it in an "or". So, it could also be used as a kind of templatey thing. But I'm not sure if this is overloading.


<rules>

  <lists>
    <list n="cumplir-n">
      <list-item lemma="urte"/>
      <list-item lemma="agindu"/>
      <list-item lemma="amets"/>
      <list-item lemma="arau"/>
      <list-item lemma="asmo"/>
      <list-item lemma="baldintza"/>
      <list-item lemma="esan" tags="n"/>
      <list-item lemma="eskubide"/>
      <list-item lemma="helburu"/>
      <list-item lemma="hitz"/>
      <list-item lemma="irizpide"/>
      <list-item lemma="lan"/>
      <list-item lemma="lege"/>
      <list-item lemma="obligazio"/>
      <list-item lemma="proposamen"/>
      <list-item lemma="erregelamendu"/>
    </list>
  </lists>

  <rule>
    <or>
      <match list="cumplir-n"/>
    </or>
    <match lemma="ez"/>
    <match lemma="izan"/>
    <match lemma="bete"><select lemma="cumplir"/></match>
  <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->


  <rule>
    <or>
      <match list="cumplir-n"/>
    </or>
    <match lemma="ez"/>
    <match lemma="izan"/>
    <match lemma="la"/>
    <match lemma="bete"><select lemma="cumplir"/></match>
  <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->


</rules>

Rule formats

Note: Felipe doesn't like "skip".

can't say I do either … sounds like a command rather than a constraint --unhammer 11:42, 18 November 2011 (UTC)

also, in the <or>, should we read them as independent of each other? that's a bit confusing since otherwise they're all required and have a certain order --unhammer 11:42, 18 November 2011 (UTC)

The regular expression for the OR below is:

(nazi<adj>[0-9A-Za-z <>]*|totalitari<adj>[0-9A-Za-z <>]*|feixista<adj>[0-9A-Za-z <>]*|franquista<adj>[0-9A-Za-z <>]*|militar<adj>[0-9A-Za-z <>]*|fiscal<adj>[0-9A-Za-z <>]*)

- Francis Tyers 11:53, 18 November 2011 (UTC)

1

  <rule>
    <remove lemma="règim" tags="n.*">
      <acception lemma="diet" tags="n.*"/>
    </remove>
    <or>
      <skip lemma="nazi" tags="adj.*"/>
      <skip lemma="totalitari" tags="adj.*"/>
      <skip lemma="feixista" tags="adj.*"/>
      <skip lemma="franquista" tags="adj.*"/>
      <skip lemma="militar" tags="adj.*"/>
      <skip lemma="fiscal" tags="adj.*"/>
    </or>
  </rule>

2

  <rule>
    <remove lemma="règim" tags="n.*">
      <acception lemma="diet" tags="n.*"/>
    </remove>
    <or>
      <pattern lemma="nazi" tags="adj.*"/>
      <pattern lemma="totalitari" tags="adj.*"/>
      <pattern lemma="feixista" tags="adj.*"/>
      <pattern lemma="franquista" tags="adj.*"/>
      <pattern lemma="militar" tags="adj.*"/>
      <pattern lemma="fiscal" tags="adj.*"/>
    </or>
  </rule>

3

  <rule>
    <remove-from lemma="règim" tags="n.*">
      <translation lemma="diet" tags="n.*"/>
    </remove-from>
    <pattern>
      <or>
        <pattern-item lemma="nazi" tags="adj.*"/>
        <pattern-item lemma="totalitari" tags="adj.*"/>
        <pattern-item lemma="feixista" tags="adj.*"/>
        <pattern-item lemma="franquista" tags="adj.*"/>
        <pattern-item lemma="militar" tags="adj.*"/>
        <pattern-item lemma="fiscal" tags="adj.*"/>
      </or>
    </pattern>
  </rule>

  <rule c="la dona dels seus somnis">
    <select-for lemma="dona" tags="n.*">
      <translation lemma="wife" tags="n.*"/>
    </select>
    <pattern>
      <pattern-item lemma="de" tags="pr.*"/>
      <pattern-item lemma="*" tags="det.pos.*"/>
      <pattern-item lemma="somni" tags="n.*"/>
    </pattern>
  </rule>

4

  <rule>
    <target lemma="règim" tags="n.*">
      <remove lemma="diet" tags="n.*"/>
    </target>
…
  </rule>

  <rule c="la dona dels seus somnis">
    <target lemma="dona" tags="n.*">
      <select lemma="wife" tags="n.*"/>
    </target>
…
  </rule>

Text

s	("estació" n)	("season" n)	(1 "plujós")
s	("estació" n)	("season" n)	(2 "plujós")
s	("estació" n)	("season" n)	(1 "de") (3 "any")
s	("estació" n)	("station" n)	(1 "de") (3 "Línia")
s	("prova" n)	("evidence" n)	(1 "arqueològic")
s	("prova" n)	("test" n)	(1 "estadístic")
s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
s	("prova" n)	("testing" n)	(-2 "tècnica") (-1 "de") 
s	("joc" n)	("game" n)	(1 "olímpic")
s	("joc" n)	("set" n)	(1 "de") (2 "caràcter")
r	("pista" n)	("hint" n)	(1 "més") (2 "llarg")
r	("pista" n)	("clue" n)	(1 "més") (2 "llarg")
r	("motiu" n)	("motif" n)	(-1 "aquest") (-2 "per")
s	("carn" n)	("flesh" n)	(1 "i") (2 "os")
s	("sobre" pr)	("over" n)	(-1 "victòria")
s       ("dona" n)      ("wife" n)      (-1 "*" det pos)
s       ("dona" n)      ("wife" n)      (-1 "el") (1 "de")
s       ("dona" n)      ("woman" n)     (1 "de") (2 "*" det pos) (3 "somni")
r       ("patró n)      ("pattern" n)   (1 "*" np ant)

Jacob's critique of the rule format

By 'skip' you actually means 'match'. And lemma selection should be called <select> :-)

So I'd write

<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">  <select lemma="wife"/>  </match>
  <match lemma="de"/>
</rule>

and actually you would like to work on categories as well as lemmas.

Like, to prefer (human) beings and not things before "feel".

<rule>
  <match tags="n.*">  <select cat="beings"/>  </match>
  <match lemma="feel"/>
</rule>

--Jacob Nordfalk 12:22, 30 November 2011 (UTC)

I like this a lot actually!

  <rule>
    <match lemma="estació"> <select lemma="season"/></match>
    <match lemma="plujós" tags="adj.*"/>
  </rule>

It seems much more consistent... Would you allow every match to have a rule, or each rule to have more than one operation ? e.g.

  <rule>
    <match lemma="juego"> <select lemma="set"/></match>
    <match lemma="de"/>
    <match lemma="etiqueta"> <select lemma="tag"/></match>
  </rule>

- Francis Tyers 11:31, 1 December 2011 (UTC)

Idea de Sergio

<rule>
  <word source-lemma="pista" source-tags="n.*">
    <remove-acception target-lemma="hint" target-tags="n.*"/>
    <remove-acception target-lemma="clue" target-tags="n.*"/>
  </word>
  <word source-lemma="més" source-tags="preadv.*"/>
  <word source-lemma="llarg" source-tags="adj.*"/>
</rule>

(10:44:10) Sergio Ortiz Rojas: mi objetivo es darle a la idea de jacob la idea de expresar la idea de bilingüe en la notación
(10:48:06) Sergio Ortiz Rojas: yo creo que lo que prefieren jacob y felipe es mejor que lo original
(10:48:19) Sergio Ortiz Rojas: la única crítica es que si lo dejáis así será demasiado "programático"
(10:48:19) Sergio Ortiz Rojas: es decir
(10:48:24) Sergio Ortiz Rojas: demasiado para gente que escribe programas
(10:48:26) Sergio Ortiz Rojas: y poco declarativo

Old application strategy

The following is an inefficient implementation of the rule application process:


# s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
#
# tipus = "select";
# centre = "^prova<n>.*"
# tl_patro = ["^event<n>.*"]
# sl_patro = {-3: "^guanyador<", -2: "^de<"}

CLASS Rule: 
        tipus = enum('select', 'remove')
        centre = '';
        tl_patro = [];
        sl_patro = {};


rule_table = {}; # e.g. rule_table["estació"] = [rule1, rule2, rule3];
i = 0

DEFINE ApplyRule(rule, lu): 
    

    FOREACH target IN lu.tl: 
        SWITCH rule.tipus:
            'select': 
                 IF target NOT IN rule.tl_patro:
                     DELETE target
            'remove': 
                 IF target IN rule.tl_patro:
                     DELETE target




FOREACH pair(sl, tl) IN sentence:
   
    FOREACH centre IN rule_table: 

        IF centre IN sl: 

            FOREACH rule IN rule_table[centre]: 

                matched = False   

                FOREACH context_item IN rule_table[centre][rule]: 

                  IF context_item in sentence: 
                      matched = True
                  ELSE:
                      matched = False
                
                # If all of the context items have matched, and none of them have not matched
                # if a rule matches break and continue to the pair. 
                IF matched == True:
 
                      sentence[i] = ApplyRule(rule_table[centre][rule], sentence[i])
                      break 

    i = i + 1

How Brian writes rules

In order to to come up with lexical rules, I use a wide variety of methods. Generally, I begin by thinking of common English words which can be used in several different contexts. Often, I look around the room for inspiration or write down ambiguous English words that I come across in my reading. Using this list of English words, I consult my English-Spanish dictionary (Collins Spanish Concise Dictionary, 6th Edition) and then check online to see if indeed, the word changes in Spanish based on the context. While checking on the internet, I visit several websites to ensure accuracy. One that I've found to be particularly reliable is www.spanishdict.com, which has an extensive English-Spanish translation dictionary, and covers most contexts of any given English word, along with many English idioms.

After I locate a word both online and in my English-Spanish dictionary, I compare the Spanish word each gives for a English word depending on the context. If the resulting word is the same, in both then I proceed to actually writing the rule. If not, I continue checking to try and come up with a consensus. If after repeated attempts I can't find the best word, I move and on and try a different word. If I am able to find with a consensus word, I move on to writing the rule.

Once I've found the word I want to use in a certain context, I begin writing the rule, paying particular attention to make sure it's a lexical selection rule rather than a multiword unit. Writing the rule is likely the easiest part of the process, as there is a template to follow. After I finish writing all of the rules, I validate first the file I'm writing the rules in, and if it validates I copy the rules into the apertium-en-es.en-es.lrx file. Before sending this file up to the SVN, I check to make sure it too validates.

Usage (old)

$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null
^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$

With rules

$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest season 
is the 
autumn, and the 
driest the 
summer.

With bilingual dictionary defaults

$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest station 
is the 
autumn, and the 
driest the 
summer.

Real training program

Depend on lttoolbox + IRSTLM
Input: biltrans output + biltrans-to-multitrans + translation
Process:
- Read biltrans output, make list of ambig words
- Translate + normalise + count.

Idea for scanning


<rule>
  <match lemma="foo"/>
  <any>
    <match tags="adv"/>
  </any>
  <match lemma="bar"><select lemma="баз"/></match>
</rule>

Would make "foo adv* bar:баз"


<rule>
  <match lemma="foo"/>
  <any>
    <or><match tags="adv"/><match tags="adj"/></or>
  </any>
  <match lemma="bar"><select lemma="баз"/></match>
</rule>

Would make "foo (adv|adj)* bar:баз"

Talk:Constraint-based lexical selection module

Contents

Lists

Rule formats

Text

Jacob's critique of the rule format

Idea de Sergio

Old application strategy

How Brian writes rules

Usage (old)

Real training program

Idea for scanning

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools