User talk:Skh/Application GSoC 2010

From Apertium
Revision as of 11:27, 8 April 2010 by Skh (talk | contribs)
Jump to navigation Jump to search

"complex multiwords which consist of two or more inflected words which do not agree with each other (french passé composé) (gender agreement not possible in generation in 1st and 2nd person and proper nouns!) "

sorry if I'm a bit slow but what does the parenthesis mean? --unhammer 17:15, 7 April 2010 (UTC)
If I translate "I am invited", this should be "je suis invitée" if a woman is speaking, but there's no way for apertium to know this, even if a human can deduce it from context. If il/elle is used, it is clear, but proper names might be unknown to the system, or ambiguous as well (Dominique can be a male or female name in France, for example). I've taken it out again, though, because it is beyond the scope of my proposal. -- Skh 22:24, 7 April 2010 (UTC)

Syntax

For complex (adj-noun) multiwords:

<e>
  <p>
    <l>gelbe<s n="adj"><s n="f"><s n="NUM"><s n="CASE">
       <br />Rübe<s n="n"><s n="f"><s n="NUM"><s n="CASE"></l>
    <r>gelbe<br />Rübe<s n="np"><s n="f"><s n="NUM"><s n="CASE"></r>
  </p>
</e>

Upper case tags indicate that the words have to agree in these categories, and that whatever values these tags have need to be preserved.

nit-picking: it probably shouldn't be
<s ...>
(a literal value), but something working like
<clip part="NUM">
(given that you've defined
<def-attr n="NUM"><attr-item tags="sg"/><attr-item tags="pl"/></def-attr>
above) --unhammer 11:06, 8 April 2010 (UTC)

Implementation

Some thoughts on implementation would be good. These don't have to be (shouldn't be!) set-in-stone, but it would strengthen the application. Eg. could the complex (tag-grouping) multiwords be modeled as an FST? Could parts of lttoolbox or apertium-pretransfer or apertium-transfer be reused? --unhammer 11:10, 8 April 2010 (UTC)

Idea


<multiwords>
  <def-attrs>
    <def-attr n="case">
      <attr-item n="nom"/>
      <attr-item n="acc"/>
      <attr-item n="dat"/>
      <attr-item n="gen"/>
    </def-attr>
    <def-attr n="case">
      <attr-item n="sg"/>
      <attr-item n="pl"/>
    </def-attr>
  </def-attrs>
  <section id="main" type="standard">
    <mwe lm="zračna luka">
       <e>
         <p>
           <l>zračna<s n="adj"/><s n="pst"/><s n="f"/><attr n="num"/><attr n="case"/></l>
           <r>zračna</r>
         </p>
         <p>
           <l>luka<s n="n"/><s n="f"/><attr n="num"/><attr n="case"/></l>
           <r>luka<s n="n"/><s n="f"/><attr n="num"/><attr n="case"/></r>
         </p>
       </e>
    </mwe>
  </section>
</multiwords>

<attr n="case"/> could be expanded to <re><(nom|acc|gen|dat)></re> ?