Difference between revisions of "User talk:Skh/Application GSoC 2010"
(→Idea: typo ;-)) |
|||
(10 intermediate revisions by 4 users not shown) | |||
Line 23: | Line 23: | ||
Some thoughts on implementation would be good. These don't have to be (shouldn't be!) set-in-stone, but it would strengthen the application. Eg. could the complex (tag-grouping) multiwords be modeled as an FST? Could parts of lttoolbox or apertium-pretransfer or apertium-transfer be reused? --[[User:Unhammer|unhammer]] 11:10, 8 April 2010 (UTC) |
Some thoughts on implementation would be good. These don't have to be (shouldn't be!) set-in-stone, but it would strengthen the application. Eg. could the complex (tag-grouping) multiwords be modeled as an FST? Could parts of lttoolbox or apertium-pretransfer or apertium-transfer be reused? --[[User:Unhammer|unhammer]] 11:10, 8 April 2010 (UTC) |
||
btw, GNU sed should be strong enough to handle <code>zračna# luka<n><f><sg><nom></code> at least ;-) |
|||
<pre> |
|||
$ echo 'zračna<adj><f><sg><nom> luka<n><f><sg><nom>'|gsed 's/zračna<adj><f><\(sg\|pl\)><\(nom\|acc\|gen\|dat\)> luka<n><f><\1><\2>/zračna# luka<n><f><\1><\2>/g' |
|||
zračna# luka<n><f><sg><nom> |
|||
</pre> |
|||
:: Yes, but that's the point: sed "regular expressions" are more powerful than formal-language-theory regular expressions. My train of thought was to find out whether a FST can handle the problem at all, theoretically, in order not to waste time with trying the impossible. See also [http://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages here] -- [[User:Skh|Skh]] 19:03, 8 April 2010 (UTC) |
|||
== Idea == |
== Idea == |
||
Line 36: | Line 44: | ||
<attr-item n="gen"/> |
<attr-item n="gen"/> |
||
</def-attr> |
</def-attr> |
||
<def-attr n=" |
<def-attr n="num"> |
||
<attr-item n="sg"/> |
<attr-item n="sg"/> |
||
<attr-item n="pl"/> |
<attr-item n="pl"/> |
||
Line 48: | Line 56: | ||
<r>zračna</r> |
<r>zračna</r> |
||
</p> |
</p> |
||
< |
<head> |
||
⚫ | |||
⚫ | |||
< |
<l>luka<s n="n"/><s n="f"/><attr n="num"/><attr n="case"/></l> |
||
⚫ | |||
⚫ | |||
</p> |
|||
</head> |
|||
</e> |
</e> |
||
</mwe> |
</mwe> |
||
Line 58: | Line 68: | ||
<attr n="case"/> could be expanded to <re><(nom|acc|gen|dat)></re> ? |
<attr n="case"/> could be expanded to <re><(nom|acc|gen|dat)></re> ? |
||
-- One problem is that regex can't be used for generation. |
|||
Analysis: |
|||
input |
|||
^zračna<adj><pst><f><sg><nom$ ^luka<n><f><sg><nom>$ |
|||
output: |
|||
^zračna luka<n><f><sg><nom>$ |
|||
Generation |
|||
input: |
|||
^zračna luka<n><f><sg><nom>$ |
|||
output: |
|||
^zračna<adj><pst><f><sg><nom$ ^luka<n><f><sg><nom>$ |
|||
</pre> |
</pre> |
||
==Comments from Mikel== |
|||
<pre> |
|||
Fran: |
|||
It is a nice proposal. Pass on my comments if you find them adequate. |
|||
I would feel better if whatever the format selected to store these |
|||
complex multiwords were converted to an Apertium-compatible format, even |
|||
if it meant expanding them in some way, before trying to touch the |
|||
compiler or something. Something like a metadix transformation (sorry to |
|||
name the devil here). This would be a conservative way of dealing with |
|||
the new problem that would make a nice first deliverable. |
|||
Instead of a special new section <multiwords...> I would have a section |
|||
at the beginning in which the variables or placeholders are defined and |
|||
then a special attribute of entries <e> (or none). Placeholders should |
|||
have a different name, but perhaps not clip. I would avoid using same |
|||
names as in transfer for things that may be substantially different. |
|||
Also, I think that focussing on contiguous multiwords would be better |
|||
than going for two kinds of multiwords. I would leave discontiguous |
|||
multiwords for a later stage. |
|||
I don't believe some statements such as "As it is now, some of the |
|||
multiword constructs can only be implemented with workarounds in the |
|||
dictionary, and some, like separable verbs, not at all.". We do have |
|||
tricks in place! |
|||
Hope this helped |
|||
Mikel |
|||
</pre> |
|||
: True, you can even [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-nn-nb/dev/apertium-nn-nb.multiwords.t1x use apertium-transfer to do buffering/token-splitting] for discontiguous multiwords... --[[User:Unhammer|unhammer]] 17:50, 8 April 2010 (UTC) |
|||
:: Thanks for the comments, and for passing them on. -- [[User:Skh|Skh]] 18:58, 8 April 2010 (UTC) |
Latest revision as of 07:19, 27 April 2010
"complex multiwords which consist of two or more inflected words which do not agree with each other (french passé composé) (gender agreement not possible in generation in 1st and 2nd person and proper nouns!) "
- sorry if I'm a bit slow but what does the parenthesis mean? --unhammer 17:15, 7 April 2010 (UTC)
- If I translate "I am invited", this should be "je suis invitée" if a woman is speaking, but there's no way for apertium to know this, even if a human can deduce it from context. If il/elle is used, it is clear, but proper names might be unknown to the system, or ambiguous as well (Dominique can be a male or female name in France, for example). I've taken it out again, though, because it is beyond the scope of my proposal. -- Skh 22:24, 7 April 2010 (UTC)
Syntax[edit]
For complex (adj-noun) multiwords:
<e> <p> <l>gelbe<s n="adj"><s n="f"><s n="NUM"><s n="CASE"> <br />Rübe<s n="n"><s n="f"><s n="NUM"><s n="CASE"></l> <r>gelbe<br />Rübe<s n="np"><s n="f"><s n="NUM"><s n="CASE"></r> </p> </e>
Upper case tags indicate that the words have to agree in these categories, and that whatever values these tags have need to be preserved.
- nit-picking: it probably shouldn't be
<s ...>
(a literal value), but something working like<clip part="NUM">
(given that you've defined<def-attr n="NUM"><attr-item tags="sg"/><attr-item tags="pl"/></def-attr>
above) --unhammer 11:06, 8 April 2010 (UTC)
Implementation[edit]
Some thoughts on implementation would be good. These don't have to be (shouldn't be!) set-in-stone, but it would strengthen the application. Eg. could the complex (tag-grouping) multiwords be modeled as an FST? Could parts of lttoolbox or apertium-pretransfer or apertium-transfer be reused? --unhammer 11:10, 8 April 2010 (UTC)
btw, GNU sed should be strong enough to handle zračna# luka<n><f><sg><nom>
at least ;-)
$ echo 'zračna<adj><f><sg><nom> luka<n><f><sg><nom>'|gsed 's/zračna<adj><f><\(sg\|pl\)><\(nom\|acc\|gen\|dat\)> luka<n><f><\1><\2>/zračna# luka<n><f><\1><\2>/g' zračna# luka<n><f><sg><nom>
- Yes, but that's the point: sed "regular expressions" are more powerful than formal-language-theory regular expressions. My train of thought was to find out whether a FST can handle the problem at all, theoretically, in order not to waste time with trying the impossible. See also here -- Skh 19:03, 8 April 2010 (UTC)
Idea[edit]
<multiwords> <def-attrs> <def-attr n="case"> <attr-item n="nom"/> <attr-item n="acc"/> <attr-item n="dat"/> <attr-item n="gen"/> </def-attr> <def-attr n="num"> <attr-item n="sg"/> <attr-item n="pl"/> </def-attr> </def-attrs> <section id="main" type="standard"> <mwe lm="zračna luka"> <e> <p> <l>zračna<s n="adj"/><s n="pst"/><s n="f"/><attr n="num"/><attr n="case"/></l> <r>zračna</r> </p> <head> <p> <l>luka<s n="n"/><s n="f"/><attr n="num"/><attr n="case"/></l> <r>luka<s n="n"/><s n="f"/><attr n="num"/><attr n="case"/></r> </p> </head> </e> </mwe> </section> </multiwords> <attr n="case"/> could be expanded to <re><(nom|acc|gen|dat)></re> ? -- One problem is that regex can't be used for generation. Analysis: input ^zračna<adj><pst><f><sg><nom$ ^luka<n><f><sg><nom>$ output: ^zračna luka<n><f><sg><nom>$ Generation input: ^zračna luka<n><f><sg><nom>$ output: ^zračna<adj><pst><f><sg><nom$ ^luka<n><f><sg><nom>$
Comments from Mikel[edit]
Fran: It is a nice proposal. Pass on my comments if you find them adequate. I would feel better if whatever the format selected to store these complex multiwords were converted to an Apertium-compatible format, even if it meant expanding them in some way, before trying to touch the compiler or something. Something like a metadix transformation (sorry to name the devil here). This would be a conservative way of dealing with the new problem that would make a nice first deliverable. Instead of a special new section <multiwords...> I would have a section at the beginning in which the variables or placeholders are defined and then a special attribute of entries <e> (or none). Placeholders should have a different name, but perhaps not clip. I would avoid using same names as in transfer for things that may be substantially different. Also, I think that focussing on contiguous multiwords would be better than going for two kinds of multiwords. I would leave discontiguous multiwords for a later stage. I don't believe some statements such as "As it is now, some of the multiword constructs can only be implemented with workarounds in the dictionary, and some, like separable verbs, not at all.". We do have tricks in place! Hope this helped Mikel
- True, you can even use apertium-transfer to do buffering/token-splitting for discontiguous multiwords... --unhammer 17:50, 8 April 2010 (UTC)
- Thanks for the comments, and for passing them on. -- Skh 18:58, 8 April 2010 (UTC)