Matching unknown words

From Apertium
Jump to: navigation, search

From time to time, the question comes up of how to match unknown words in transfer. In interchunk, this is quite easy, as each unknown word has the chunk lemma 'unknown', but it's un- or under-documented how this should be done using apertium-transfer.

The answer is to use a cat-item with an empty tags attribute:

     <cat-item tags=""/>

An example:

<?xml version="1.0"?>
<transfer default="chunk">
 <section-def-cats>
   <def-cat n="any">
     <cat-item tags="*"/>
   </def-cat>

   <def-cat n="unk">
     <cat-item tags=""/>
   </def-cat>

 </section-def-cats>

 <section-def-attrs>
 </section-def-attrs>

 <section-def-vars>
 </section-def-vars>

 <section-rules>
   <rule>
     <pattern>
       <pattern-item n="any"/>
     </pattern>
     <action>
       <out>
         <chunk name="any">
           <tags>
             <tag><clip pos="1" side="tl" part="tags"/></tag>
           </tags>
           <lu><clip pos="1" side="tl" part="whole"/></lu>
         </chunk>
       </out>
     </action>
   </rule>
   <rule>
     <pattern>
       <pattern-item n="unk"/>
     </pattern>
     <action>
       <out>
         <chunk name="unk">
           <tags>
             <tag><clip pos="1" side="tl" part="tags"/></tag>
           </tags>
           <lu><clip pos="1" side="tl" part="whole"/></lu>
         </chunk>
       </out>
     </action>
   </rule>
 </section-rules>
</transfer>

Note that tags must be present (otherwise, the opening brace of the chunk is omitted).

Use:

echo '^foo<n><sg>$ ^*bar$'|apertium-transfer -n unk.t1x unk.bin 
^any<n><sg>{^foo<n><sg>$}$ ^unk{^*bar$}$
Personal tools