Difference between revisions of "Talk:Constraint-based lexical selection module"
m (s/ure/sure/) |
|||
(4 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
==Lists== |
|||
Here is a suggestion of how lists should work. |
|||
The list would be intepreted as a sequence of "match" tags, eg. when we compile, we just expand "list-items" to "match"s. This means that if we want it to be a list in the normal sense, we need to put it in an "or". So, it could also be used as a kind of templatey thing. But I'm not sure if this is overloading. |
|||
<pre> |
|||
<rules> |
|||
<lists> |
|||
<list n="cumplir-n"> |
|||
<list-item lemma="urte"/> |
|||
<list-item lemma="agindu"/> |
|||
<list-item lemma="amets"/> |
|||
<list-item lemma="arau"/> |
|||
<list-item lemma="asmo"/> |
|||
<list-item lemma="baldintza"/> |
|||
<list-item lemma="esan" tags="n"/> |
|||
<list-item lemma="eskubide"/> |
|||
<list-item lemma="helburu"/> |
|||
<list-item lemma="hitz"/> |
|||
<list-item lemma="irizpide"/> |
|||
<list-item lemma="lan"/> |
|||
<list-item lemma="lege"/> |
|||
<list-item lemma="obligazio"/> |
|||
<list-item lemma="proposamen"/> |
|||
<list-item lemma="erregelamendu"/> |
|||
</list> |
|||
</lists> |
|||
<rule> |
|||
<or> |
|||
<match list="cumplir-n"/> |
|||
</or> |
|||
<match lemma="ez"/> |
|||
<match lemma="izan"/> |
|||
<match lemma="bete"><select lemma="cumplir"/></match> |
|||
<rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); --> |
|||
<rule> |
|||
<or> |
|||
<match list="cumplir-n"/> |
|||
</or> |
|||
<match lemma="ez"/> |
|||
<match lemma="izan"/> |
|||
<match lemma="la"/> |
|||
<match lemma="bete"><select lemma="cumplir"/></match> |
|||
<rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); --> |
|||
</rules> |
|||
</pre> |
|||
==Rule formats== |
==Rule formats== |
||
Line 306: | Line 362: | ||
</pre> |
</pre> |
||
==Real training program== |
|||
==Work plan (12th March 2012)== |
|||
* Depend on lttoolbox + IRSTLM |
|||
# Learn rules from a corpus where the best translation is selected by the TLM |
|||
* Input: biltrans output + biltrans-to-multitrans + translation |
|||
# Try changing the parameters of the LM |
|||
* Process: |
|||
## Remove stopwords |
|||
** Read biltrans output, make list of ambig words |
|||
## Rank on lemmata |
|||
** Translate + normalise + count. |
|||
# Try neighbouring word permutations, and ranking on LM |
|||
## Remove stopwords |
|||
## Rank on lemmata |
|||
# Run tests on eu-es |
|||
==Idea for scanning== |
|||
{|class=wikitable |
|||
! !! LER !! WER !! BLEU |
|||
<pre> |
|||
|- |
|||
| TLM (s5) || 248/603 41.1% || || |
|||
<rule> |
|||
|- |
|||
<match lemma="foo"/> |
|||
| TLM (l5) || 211/603 34.9% || || |
|||
<any> |
|||
|- |
|||
<match tags="adv"/> |
|||
| TLM (n5) || 271/603 44.9% || || |
|||
</any> |
|||
|- |
|||
<match lemma="bar"><select lemma="баз"/></match> |
|||
| TLM (l5+flip2) || 220/603 36.4% || || |
|||
</rule> |
|||
|- |
|||
| TLM (ptes-s5) || -/603 48.4% || || |
|||
</pre> |
|||
|- |
|||
| TLM (fres-s5) || 280/603 46.4% || || |
|||
Would make "foo adv* bar:баз" |
|||
|- |
|||
|} |
|||
<pre> |
|||
<rule> |
|||
<match lemma="foo"/> |
|||
<any> |
|||
<or><match tags="adv"/><match tags="adj"/></or> |
|||
</any> |
|||
<match lemma="bar"><select lemma="баз"/></match> |
|||
</rule> |
|||
</pre> |
|||
Programs to make: |
|||
Would make "foo (adv|adj)* bar:баз" |
|||
# Program that takes biltrans as output, generates all possible combos, translates, scores on language model takes the biltrans output with the highest LM score, and compares against a reference biltrans output. |
Latest revision as of 14:10, 19 August 2014
Lists[edit]
Here is a suggestion of how lists should work.
The list would be intepreted as a sequence of "match" tags, eg. when we compile, we just expand "list-items" to "match"s. This means that if we want it to be a list in the normal sense, we need to put it in an "or". So, it could also be used as a kind of templatey thing. But I'm not sure if this is overloading.
<rules> <lists> <list n="cumplir-n"> <list-item lemma="urte"/> <list-item lemma="agindu"/> <list-item lemma="amets"/> <list-item lemma="arau"/> <list-item lemma="asmo"/> <list-item lemma="baldintza"/> <list-item lemma="esan" tags="n"/> <list-item lemma="eskubide"/> <list-item lemma="helburu"/> <list-item lemma="hitz"/> <list-item lemma="irizpide"/> <list-item lemma="lan"/> <list-item lemma="lege"/> <list-item lemma="obligazio"/> <list-item lemma="proposamen"/> <list-item lemma="erregelamendu"/> </list> </lists> <rule> <or> <match list="cumplir-n"/> </or> <match lemma="ez"/> <match lemma="izan"/> <match lemma="bete"><select lemma="cumplir"/></match> <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); --> <rule> <or> <match list="cumplir-n"/> </or> <match lemma="ez"/> <match lemma="izan"/> <match lemma="la"/> <match lemma="bete"><select lemma="cumplir"/></match> <rule><!--SELECT ("cumplir" n) IF (-3 ("urte") OR CUMPLIR-N LINK 0 abs)(-2 ("ez"))(-1 ("izan")); --> </rules>
Rule formats[edit]
Note: Felipe doesn't like "skip".
- can't say I do either … sounds like a command rather than a constraint --unhammer 11:42, 18 November 2011 (UTC)
- also, in the <or>, should we read them as independent of each other? that's a bit confusing since otherwise they're all required and have a certain order --unhammer 11:42, 18 November 2011 (UTC)
The regular expression for the OR below is:
(nazi<adj>[0-9A-Za-z <>]*|totalitari<adj>[0-9A-Za-z <>]*|feixista<adj>[0-9A-Za-z <>]*|franquista<adj>[0-9A-Za-z <>]*|militar<adj>[0-9A-Za-z <>]*|fiscal<adj>[0-9A-Za-z <>]*)
- Francis Tyers 11:53, 18 November 2011 (UTC)
- 1
<rule> <remove lemma="règim" tags="n.*"> <acception lemma="diet" tags="n.*"/> </remove> <or> <skip lemma="nazi" tags="adj.*"/> <skip lemma="totalitari" tags="adj.*"/> <skip lemma="feixista" tags="adj.*"/> <skip lemma="franquista" tags="adj.*"/> <skip lemma="militar" tags="adj.*"/> <skip lemma="fiscal" tags="adj.*"/> </or> </rule>
- 2
<rule> <remove lemma="règim" tags="n.*"> <acception lemma="diet" tags="n.*"/> </remove> <or> <pattern lemma="nazi" tags="adj.*"/> <pattern lemma="totalitari" tags="adj.*"/> <pattern lemma="feixista" tags="adj.*"/> <pattern lemma="franquista" tags="adj.*"/> <pattern lemma="militar" tags="adj.*"/> <pattern lemma="fiscal" tags="adj.*"/> </or> </rule>
- 3
<rule> <remove-from lemma="règim" tags="n.*"> <translation lemma="diet" tags="n.*"/> </remove-from> <pattern> <or> <pattern-item lemma="nazi" tags="adj.*"/> <pattern-item lemma="totalitari" tags="adj.*"/> <pattern-item lemma="feixista" tags="adj.*"/> <pattern-item lemma="franquista" tags="adj.*"/> <pattern-item lemma="militar" tags="adj.*"/> <pattern-item lemma="fiscal" tags="adj.*"/> </or> </pattern> </rule> <rule c="la dona dels seus somnis"> <select-for lemma="dona" tags="n.*"> <translation lemma="wife" tags="n.*"/> </select> <pattern> <pattern-item lemma="de" tags="pr.*"/> <pattern-item lemma="*" tags="det.pos.*"/> <pattern-item lemma="somni" tags="n.*"/> </pattern> </rule>
- 4
<rule> <target lemma="règim" tags="n.*"> <remove lemma="diet" tags="n.*"/> </target> … </rule> <rule c="la dona dels seus somnis"> <target lemma="dona" tags="n.*"> <select lemma="wife" tags="n.*"/> </target> … </rule>
Text[edit]
s ("estació" n) ("season" n) (1 "plujós") s ("estació" n) ("season" n) (2 "plujós") s ("estació" n) ("season" n) (1 "de") (3 "any") s ("estació" n) ("station" n) (1 "de") (3 "Línia") s ("prova" n) ("evidence" n) (1 "arqueològic") s ("prova" n) ("test" n) (1 "estadístic") s ("prova" n) ("event" n) (-3 "guanyador") (-2 "de") s ("prova" n) ("testing" n) (-2 "tècnica") (-1 "de") s ("joc" n) ("game" n) (1 "olímpic") s ("joc" n) ("set" n) (1 "de") (2 "caràcter") r ("pista" n) ("hint" n) (1 "més") (2 "llarg") r ("pista" n) ("clue" n) (1 "més") (2 "llarg") r ("motiu" n) ("motif" n) (-1 "aquest") (-2 "per") s ("carn" n) ("flesh" n) (1 "i") (2 "os") s ("sobre" pr) ("over" n) (-1 "victòria") s ("dona" n) ("wife" n) (-1 "*" det pos) s ("dona" n) ("wife" n) (-1 "el") (1 "de") s ("dona" n) ("woman" n) (1 "de") (2 "*" det pos) (3 "somni") r ("patró n) ("pattern" n) (1 "*" np ant)
Jacob's critique of the rule format[edit]
By 'skip' you actually means 'match'. And lemma selection should be called <select> :-)
So I'd write
<rule> <match lemma="el"/> <match lemma="dona" tags="n.*"> <select lemma="wife"/> </match> <match lemma="de"/> </rule>
and actually you would like to work on categories as well as lemmas.
Like, to prefer (human) beings and not things before "feel".
<rule> <match tags="n.*"> <select cat="beings"/> </match> <match lemma="feel"/> </rule>
--Jacob Nordfalk 12:22, 30 November 2011 (UTC)
- I like this a lot actually!
<rule> <match lemma="estació"> <select lemma="season"/></match> <match lemma="plujós" tags="adj.*"/> </rule>
It seems much more consistent... Would you allow every match to have a rule, or each rule to have more than one operation ? e.g.
<rule> <match lemma="juego"> <select lemma="set"/></match> <match lemma="de"/> <match lemma="etiqueta"> <select lemma="tag"/></match> </rule>
- Francis Tyers 11:31, 1 December 2011 (UTC)
Idea de Sergio[edit]
<rule> <word source-lemma="pista" source-tags="n.*"> <remove-acception target-lemma="hint" target-tags="n.*"/> <remove-acception target-lemma="clue" target-tags="n.*"/> </word> <word source-lemma="més" source-tags="preadv.*"/> <word source-lemma="llarg" source-tags="adj.*"/> </rule> (10:44:10) Sergio Ortiz Rojas: mi objetivo es darle a la idea de jacob la idea de expresar la idea de bilingüe en la notación (10:48:06) Sergio Ortiz Rojas: yo creo que lo que prefieren jacob y felipe es mejor que lo original (10:48:19) Sergio Ortiz Rojas: la única crítica es que si lo dejáis así será demasiado "programático" (10:48:19) Sergio Ortiz Rojas: es decir (10:48:24) Sergio Ortiz Rojas: demasiado para gente que escribe programas (10:48:26) Sergio Ortiz Rojas: y poco declarativo
Old application strategy[edit]
The following is an inefficient implementation of the rule application process:
# s ("prova" n) ("event" n) (-3 "guanyador") (-2 "de") # # tipus = "select"; # centre = "^prova<n>.*" # tl_patro = ["^event<n>.*"] # sl_patro = {-3: "^guanyador<", -2: "^de<"} CLASS Rule: tipus = enum('select', 'remove') centre = ''; tl_patro = []; sl_patro = {}; rule_table = {}; # e.g. rule_table["estació"] = [rule1, rule2, rule3]; i = 0 DEFINE ApplyRule(rule, lu): FOREACH target IN lu.tl: SWITCH rule.tipus: 'select': IF target NOT IN rule.tl_patro: DELETE target 'remove': IF target IN rule.tl_patro: DELETE target FOREACH pair(sl, tl) IN sentence: FOREACH centre IN rule_table: IF centre IN sl: FOREACH rule IN rule_table[centre]: matched = False FOREACH context_item IN rule_table[centre][rule]: IF context_item in sentence: matched = True ELSE: matched = False # If all of the context items have matched, and none of them have not matched # if a rule matches break and continue to the pair. IF matched == True: sentence[i] = ApplyRule(rule_table[centre][rule], sentence[i]) break i = i + 1
How Brian writes rules[edit]
In order to to come up with lexical rules, I use a wide variety of methods. Generally, I begin by thinking of common English words which can be used in several different contexts. Often, I look around the room for inspiration or write down ambiguous English words that I come across in my reading. Using this list of English words, I consult my English-Spanish dictionary (Collins Spanish Concise Dictionary, 6th Edition) and then check online to see if indeed, the word changes in Spanish based on the context. While checking on the internet, I visit several websites to ensure accuracy. One that I've found to be particularly reliable is www.spanishdict.com, which has an extensive English-Spanish translation dictionary, and covers most contexts of any given English word, along with many English idioms.
After I locate a word both online and in my English-Spanish dictionary, I compare the Spanish word each gives for a English word depending on the context. If the resulting word is the same, in both then I proceed to actually writing the rule. If not, I continue checking to try and come up with a consensus. If after repeated attempts I can't find the best word, I move and on and try a different word. If I am able to find with a consensus word, I move on to writing the rule.
Once I've found the word I want to use in a certain context, I begin writing the rule, paying particular attention to make sure it's a lexical selection rule rather than a multiword unit. Writing the rule is likely the easiest part of the process, as there is a template to follow. After I finish writing all of the rules, I validate first the file I'm writing the rules in, and if it validates I copy the rules into the apertium-en-es.en-es.lrx file. Before sending this file up to the SVN, I check to make sure it too validates.
Usage (old)[edit]
$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null ^El<det><def><f><sg>/The<det><def><f><sg>$ ^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ ^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$
- With rules
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\ apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin The rainiest season is the autumn, and the driest the summer.
- With bilingual dictionary defaults
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\ apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin The rainiest station is the autumn, and the driest the summer.
Real training program[edit]
- Depend on lttoolbox + IRSTLM
- Input: biltrans output + biltrans-to-multitrans + translation
- Process:
- Read biltrans output, make list of ambig words
- Translate + normalise + count.
Idea for scanning[edit]
<rule> <match lemma="foo"/> <any> <match tags="adv"/> </any> <match lemma="bar"><select lemma="баз"/></match> </rule>
Would make "foo adv* bar:баз"
<rule> <match lemma="foo"/> <any> <or><match tags="adv"/><match tags="adj"/></or> </any> <match lemma="bar"><select lemma="баз"/></match> </rule>
Would make "foo (adv|adj)* bar:баз"