Difference between revisions of "Talk:Constraint-based lexical selection module"

From Apertium
Jump to navigation Jump to search
m (s/ure/sure/)
 
(15 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}

==Lists==

Here is a suggestion of how lists should work.

The list would be intepreted as a sequence of "match" tags, eg. when we compile, we just expand "list-items" to "match"s. This means that if we want it to be a list in the normal sense, we need to put it in an "or". So, it could also be used as a kind of templatey thing. But I'm not sure if this is overloading.

<pre>

<rules>

<lists>
<list n="cumplir-n">
<list-item lemma="urte"/>
<list-item lemma="agindu"/>
<list-item lemma="amets"/>
<list-item lemma="arau"/>
<list-item lemma="asmo"/>
<list-item lemma="baldintza"/>
<list-item lemma="esan" tags="n"/>
<list-item lemma="eskubide"/>
<list-item lemma="helburu"/>
<list-item lemma="hitz"/>
<list-item lemma="irizpide"/>
<list-item lemma="lan"/>
<list-item lemma="lege"/>
<list-item lemma="obligazio"/>
<list-item lemma="proposamen"/>
<list-item lemma="erregelamendu"/>
</list>
</lists>

<rule>
<or>
<match list="cumplir-n"/>
</or>
<match lemma="ez"/>
<match lemma="izan"/>
<match lemma="bete"><select lemma="cumplir"/></match>
<rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->


<rule>
<or>
<match list="cumplir-n"/>
</or>
<match lemma="ez"/>
<match lemma="izan"/>
<match lemma="la"/>
<match lemma="bete"><select lemma="cumplir"/></match>
<rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->


</rules>

</pre>


==Rule formats==
==Rule formats==
Line 265: Line 321:


Once I've found the word I want to use in a certain context, I begin writing the rule, paying particular attention to make sure it's a lexical selection rule rather than a multiword unit. Writing the rule is likely the easiest part of the process, as there is a template to follow. After I finish writing all of the rules, I validate first the file I'm writing the rules in, and if it validates I copy the rules into the apertium-en-es.en-es.lrx file. Before sending this file up to the SVN, I check to make sure it too validates.
Once I've found the word I want to use in a certain context, I begin writing the rule, paying particular attention to make sure it's a lexical selection rule rather than a multiword unit. Writing the rule is likely the easiest part of the process, as there is a template to follow. After I finish writing all of the rules, I validate first the file I'm writing the rules in, and if it validates I copy the rules into the apertium-en-es.en-es.lrx file. Before sending this file up to the SVN, I check to make sure it too validates.

==Usage (old)==

<pre>
$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null
^El<det><def><f><sg>/The<det><def><f><sg>$
^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$
^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$
</pre>

;With rules

<pre>
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The
rainiest season
is the
autumn, and the
driest the
summer.
</pre>

;With bilingual dictionary defaults

<pre>
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The
rainiest station
is the
autumn, and the
driest the
summer.
</pre>

==Real training program==

* Depend on lttoolbox + IRSTLM
* Input: biltrans output + biltrans-to-multitrans + translation
* Process:
** Read biltrans output, make list of ambig words
** Translate + normalise + count.

==Idea for scanning==

<pre>

<rule>
<match lemma="foo"/>
<any>
<match tags="adv"/>
</any>
<match lemma="bar"><select lemma="баз"/></match>
</rule>

</pre>

Would make "foo adv* bar:баз"

<pre>

<rule>
<match lemma="foo"/>
<any>
<or><match tags="adv"/><match tags="adj"/></or>
</any>
<match lemma="bar"><select lemma="баз"/></match>
</rule>

</pre>


Would make "foo (adv|adj)* bar:баз"

Latest revision as of 14:10, 19 August 2014

Lists[edit]

Here is a suggestion of how lists should work.

The list would be intepreted as a sequence of "match" tags, eg. when we compile, we just expand "list-items" to "match"s. This means that if we want it to be a list in the normal sense, we need to put it in an "or". So, it could also be used as a kind of templatey thing. But I'm not sure if this is overloading.


<rules>

  <lists>
    <list n="cumplir-n">
      <list-item lemma="urte"/>
      <list-item lemma="agindu"/>
      <list-item lemma="amets"/>
      <list-item lemma="arau"/>
      <list-item lemma="asmo"/>
      <list-item lemma="baldintza"/>
      <list-item lemma="esan" tags="n"/>
      <list-item lemma="eskubide"/>
      <list-item lemma="helburu"/>
      <list-item lemma="hitz"/>
      <list-item lemma="irizpide"/>
      <list-item lemma="lan"/>
      <list-item lemma="lege"/>
      <list-item lemma="obligazio"/>
      <list-item lemma="proposamen"/>
      <list-item lemma="erregelamendu"/>
    </list>
  </lists>

  <rule>
    <or>
      <match list="cumplir-n"/>
    </or>
    <match lemma="ez"/>
    <match lemma="izan"/>
    <match lemma="bete"><select lemma="cumplir"/></match>
  <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->


  <rule>
    <or>
      <match list="cumplir-n"/>
    </or>
    <match lemma="ez"/>
    <match lemma="izan"/>
    <match lemma="la"/>
    <match lemma="bete"><select lemma="cumplir"/></match>
  <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); -->


</rules>

Rule formats[edit]

Note: Felipe doesn't like "skip".

can't say I do either … sounds like a command rather than a constraint --unhammer 11:42, 18 November 2011 (UTC)
also, in the <or>, should we read them as independent of each other? that's a bit confusing since otherwise they're all required and have a certain order --unhammer 11:42, 18 November 2011 (UTC)

The regular expression for the OR below is:

(nazi<adj>[0-9A-Za-z <>]*|totalitari<adj>[0-9A-Za-z <>]*|feixista<adj>[0-9A-Za-z <>]*|franquista<adj>[0-9A-Za-z <>]*|militar<adj>[0-9A-Za-z <>]*|fiscal<adj>[0-9A-Za-z <>]*)

- Francis Tyers 11:53, 18 November 2011 (UTC)

1
  <rule>
    <remove lemma="règim" tags="n.*">
      <acception lemma="diet" tags="n.*"/>
    </remove>
    <or>
      <skip lemma="nazi" tags="adj.*"/>
      <skip lemma="totalitari" tags="adj.*"/>
      <skip lemma="feixista" tags="adj.*"/>
      <skip lemma="franquista" tags="adj.*"/>
      <skip lemma="militar" tags="adj.*"/>
      <skip lemma="fiscal" tags="adj.*"/>
    </or>
  </rule>
2
  <rule>
    <remove lemma="règim" tags="n.*">
      <acception lemma="diet" tags="n.*"/>
    </remove>
    <or>
      <pattern lemma="nazi" tags="adj.*"/>
      <pattern lemma="totalitari" tags="adj.*"/>
      <pattern lemma="feixista" tags="adj.*"/>
      <pattern lemma="franquista" tags="adj.*"/>
      <pattern lemma="militar" tags="adj.*"/>
      <pattern lemma="fiscal" tags="adj.*"/>
    </or>
  </rule>
3
  <rule>
    <remove-from lemma="règim" tags="n.*">
      <translation lemma="diet" tags="n.*"/>
    </remove-from>
    <pattern>
      <or>
        <pattern-item lemma="nazi" tags="adj.*"/>
        <pattern-item lemma="totalitari" tags="adj.*"/>
        <pattern-item lemma="feixista" tags="adj.*"/>
        <pattern-item lemma="franquista" tags="adj.*"/>
        <pattern-item lemma="militar" tags="adj.*"/>
        <pattern-item lemma="fiscal" tags="adj.*"/>
      </or>
    </pattern>
  </rule>

  <rule c="la dona dels seus somnis">
    <select-for lemma="dona" tags="n.*">
      <translation lemma="wife" tags="n.*"/>
    </select>
    <pattern>
      <pattern-item lemma="de" tags="pr.*"/>
      <pattern-item lemma="*" tags="det.pos.*"/>
      <pattern-item lemma="somni" tags="n.*"/>
    </pattern>
  </rule>

4
  <rule>
    <target lemma="règim" tags="n.*">
      <remove lemma="diet" tags="n.*"/>
    </target>
…
  </rule>

  <rule c="la dona dels seus somnis">
    <target lemma="dona" tags="n.*">
      <select lemma="wife" tags="n.*"/>
    </target>
…
  </rule>


Text[edit]

s	("estació" n)	("season" n)	(1 "plujós")
s	("estació" n)	("season" n)	(2 "plujós")
s	("estació" n)	("season" n)	(1 "de") (3 "any")
s	("estació" n)	("station" n)	(1 "de") (3 "Línia")
s	("prova" n)	("evidence" n)	(1 "arqueològic")
s	("prova" n)	("test" n)	(1 "estadístic")
s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
s	("prova" n)	("testing" n)	(-2 "tècnica") (-1 "de") 
s	("joc" n)	("game" n)	(1 "olímpic")
s	("joc" n)	("set" n)	(1 "de") (2 "caràcter")
r	("pista" n)	("hint" n)	(1 "més") (2 "llarg")
r	("pista" n)	("clue" n)	(1 "més") (2 "llarg")
r	("motiu" n)	("motif" n)	(-1 "aquest") (-2 "per")
s	("carn" n)	("flesh" n)	(1 "i") (2 "os")
s	("sobre" pr)	("over" n)	(-1 "victòria")
s       ("dona" n)      ("wife" n)      (-1 "*" det pos)
s       ("dona" n)      ("wife" n)      (-1 "el") (1 "de")
s       ("dona" n)      ("woman" n)     (1 "de") (2 "*" det pos) (3 "somni")
r       ("patró n)      ("pattern" n)   (1 "*" np ant)

Jacob's critique of the rule format[edit]

By 'skip' you actually means 'match'. And lemma selection should be called <select> :-)

So I'd write

<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">  <select lemma="wife"/>  </match>
  <match lemma="de"/>
</rule>

and actually you would like to work on categories as well as lemmas.

Like, to prefer (human) beings and not things before "feel".

<rule>
  <match tags="n.*">  <select cat="beings"/>  </match>
  <match lemma="feel"/>
</rule>

--Jacob Nordfalk 12:22, 30 November 2011 (UTC)


I like this a lot actually!
  <rule>
    <match lemma="estació"> <select lemma="season"/></match>
    <match lemma="plujós" tags="adj.*"/>
  </rule>

It seems much more consistent... Would you allow every match to have a rule, or each rule to have more than one operation ? e.g.

  <rule>
    <match lemma="juego"> <select lemma="set"/></match>
    <match lemma="de"/>
    <match lemma="etiqueta"> <select lemma="tag"/></match>
  </rule>

- Francis Tyers 11:31, 1 December 2011 (UTC)

Idea de Sergio[edit]

<rule>
  <word source-lemma="pista" source-tags="n.*">
    <remove-acception target-lemma="hint" target-tags="n.*"/>
    <remove-acception target-lemma="clue" target-tags="n.*"/>
  </word>
  <word source-lemma="més" source-tags="preadv.*"/>
  <word source-lemma="llarg" source-tags="adj.*"/>
</rule>

(10:44:10) Sergio Ortiz Rojas: mi objetivo es darle a la idea de jacob la idea de expresar la idea de bilingüe en la notación
(10:48:06) Sergio Ortiz Rojas: yo creo que lo que prefieren jacob y felipe es mejor que lo original
(10:48:19) Sergio Ortiz Rojas: la única crítica es que si lo dejáis así será demasiado "programático"
(10:48:19) Sergio Ortiz Rojas: es decir
(10:48:24) Sergio Ortiz Rojas: demasiado para gente que escribe programas
(10:48:26) Sergio Ortiz Rojas: y poco declarativo


Old application strategy[edit]

The following is an inefficient implementation of the rule application process:


# s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
#
# tipus = "select";
# centre = "^prova<n>.*"
# tl_patro = ["^event<n>.*"]
# sl_patro = {-3: "^guanyador<", -2: "^de<"}

CLASS Rule: 
        tipus = enum('select', 'remove')
        centre = '';
        tl_patro = [];
        sl_patro = {};


rule_table = {}; # e.g. rule_table["estació"] = [rule1, rule2, rule3];
i = 0

DEFINE ApplyRule(rule, lu): 
    

    FOREACH target IN lu.tl: 
        SWITCH rule.tipus:
            'select': 
                 IF target NOT IN rule.tl_patro:
                     DELETE target
            'remove': 
                 IF target IN rule.tl_patro:
                     DELETE target




FOREACH pair(sl, tl) IN sentence:
   
    FOREACH centre IN rule_table: 

        IF centre IN sl: 

            FOREACH rule IN rule_table[centre]: 

                matched = False   

                FOREACH context_item IN rule_table[centre][rule]: 

                  IF context_item in sentence: 
                      matched = True
                  ELSE:
                      matched = False
                
                # If all of the context items have matched, and none of them have not matched
                # if a rule matches break and continue to the pair. 
                IF matched == True:
 
                      sentence[i] = ApplyRule(rule_table[centre][rule], sentence[i])
                      break 

    i = i + 1

How Brian writes rules[edit]

In order to to come up with lexical rules, I use a wide variety of methods. Generally, I begin by thinking of common English words which can be used in several different contexts. Often, I look around the room for inspiration or write down ambiguous English words that I come across in my reading. Using this list of English words, I consult my English-Spanish dictionary (Collins Spanish Concise Dictionary, 6th Edition) and then check online to see if indeed, the word changes in Spanish based on the context. While checking on the internet, I visit several websites to ensure accuracy. One that I've found to be particularly reliable is www.spanishdict.com, which has an extensive English-Spanish translation dictionary, and covers most contexts of any given English word, along with many English idioms.


After I locate a word both online and in my English-Spanish dictionary, I compare the Spanish word each gives for a English word depending on the context. If the resulting word is the same, in both then I proceed to actually writing the rule. If not, I continue checking to try and come up with a consensus. If after repeated attempts I can't find the best word, I move and on and try a different word. If I am able to find with a consensus word, I move on to writing the rule.


Once I've found the word I want to use in a certain context, I begin writing the rule, paying particular attention to make sure it's a lexical selection rule rather than a multiword unit. Writing the rule is likely the easiest part of the process, as there is a template to follow. After I finish writing all of the rules, I validate first the file I'm writing the rules in, and if it validates I copy the rules into the apertium-en-es.en-es.lrx file. Before sending this file up to the SVN, I check to make sure it too validates.

Usage (old)[edit]

$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null
^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$ 
With rules
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest season 
is the 
autumn, and the 
driest the 
summer. 
With bilingual dictionary defaults
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest station 
is the 
autumn, and the 
driest the 
summer.

Real training program[edit]

  • Depend on lttoolbox + IRSTLM
  • Input: biltrans output + biltrans-to-multitrans + translation
  • Process:
    • Read biltrans output, make list of ambig words
    • Translate + normalise + count.

Idea for scanning[edit]


<rule>
  <match lemma="foo"/>
  <any>
    <match tags="adv"/>
  </any>
  <match lemma="bar"><select lemma="баз"/></match>
</rule>

Would make "foo adv* bar:баз"


<rule>
  <match lemma="foo"/>
  <any>
    <or><match tags="adv"/><match tags="adj"/></or>
  </any>
  <match lemma="bar"><select lemma="баз"/></match>
</rule>


Would make "foo (adv|adj)* bar:баз"