Difference between revisions of "Extract"

From Apertium
Jump to navigation Jump to search
 
(9 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
There are two versions of '''extract''', the first version supports Unicode (although not in paradigm names), the second doesn't support Unicode, but supports a system of constraints. For Apertium use, I recommend the first version. Any constraints can be applied using the [[constraint grammar]].
 
There are two versions of '''extract''', the first version supports Unicode (although not in paradigm names), the second doesn't support Unicode, but supports a system of constraints. For Apertium use, I recommend the first version. Any constraints can be applied using the [[constraint grammar]].
   
  +
'''Note:''' on Debian/Ubuntu, you will need to install the <code>libghc6-regex-compat-dev</code> package.
==Paradigms==
 
   
  +
==Example paradigm==
;Apertium
 
 
<pre>
 
<pardef n="wol/f__n">
 
<e>
 
<p>
 
<l>f</l>
 
<r>f<s n="n"/><s n="sg"/>
 
</p>
 
</e>
 
<e>
 
<p>
 
<l>ves</l>
 
<r>f<s n="n"/><s n="pl"/>
 
</p>
 
</e>
 
</pardef>
 
</pre>
 
   
 
;Extract
 
;Extract
  +
  +
The lemma is formed by adding "f" to the stem, the plural is formed by adding "ves" to the stem, and adding "fing" to the stem is forbidden.
   
 
<pre>
 
<pre>
 
paradigm wol_f__n =
 
paradigm wol_f__n =
 
x+"f"
 
x+"f"
{ x+"ves" & ~(x+"ing")} ;
+
{ x+"ves" & ~(x+"fing")} ;
   
 
</pre>
  +
  +
== Troubleshooting ==
  +
  +
If you are using '''apertium2extract''' to convert from Apertium paradigms to Extract paradigms and using Ubuntu 9.04, you might encounter this error when you run the script for the first time.
  +
 
<pre>
  +
Traceback (most recent call last):
  +
File "apertium2extract.py", line 7, in <module>
  +
from xml import xpath;
  +
File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/__init__.py", line 105, in <module>
  +
import Context
  +
File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/Context.py", line 15, in <module>
  +
import CoreFunctions
  +
File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/CoreFunctions.py", line 20, in <module>
  +
from xml.xpath import Util, Conversions
  +
File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/Conversions.py", line 22, in <module>
  +
from xml.utils import boolean
  +
ImportError: cannot import name boolean
  +
</pre>
  +
  +
This is because the python-xml package is broken in Ubuntu 9.04 (check this [https://bugs.launchpad.net/ubuntu/+source/python-xml/+bug/343242 link]).
  +
  +
For a quick and dirty fix do the following
  +
<pre>
  +
sudo mv /usr/lib/python2.6/dist-packages/_xmlplus/utils /usr/lib/python2.6/dist-packages/_xmlplus/utils.backup
  +
sudo cp -r /usr/lib/python2.6/dist-packages/oldxml/_xmlplus/utils/ /usr/lib/python2.6/dist-packages/_xmlplus/utils
 
</pre>
 
</pre>
   
 
==External links==
 
==External links==
   
  +
* [http://xixona.dlsi.ua.es/~fran/apertium2extract.py apertium2extract] &mdash; a script for converting Apertium paradigms to Extract paradigms.
 
* [http://www.cs.chalmers.se/~markus/extract Lexicon Extraction Tool]
 
* [http://www.cs.chalmers.se/~markus/extract Lexicon Extraction Tool]
   

Latest revision as of 12:51, 17 August 2009

The extract tool is a program for matching word forms (for example from a corpus) to lemmata and paradigms. The paradigms in extract are not the same as Apertium paradigms in that they can contain both "inclusions" and "exclusions" for matching purposes. For example, if you wanted to match nouns but not verbs in English, you might write an extract paradigm saying "root + s", but not "root + ing".

There are two versions of extract, the first version supports Unicode (although not in paradigm names), the second doesn't support Unicode, but supports a system of constraints. For Apertium use, I recommend the first version. Any constraints can be applied using the constraint grammar.

Note: on Debian/Ubuntu, you will need to install the libghc6-regex-compat-dev package.

Example paradigm[edit]

Extract

The lemma is formed by adding "f" to the stem, the plural is formed by adding "ves" to the stem, and adding "fing" to the stem is forbidden.

paradigm wol_f__n = 
        x+"f"
        { x+"ves" & ~(x+"fing")} ;

Troubleshooting[edit]

If you are using apertium2extract to convert from Apertium paradigms to Extract paradigms and using Ubuntu 9.04, you might encounter this error when you run the script for the first time.

Traceback (most recent call last):
  File "apertium2extract.py", line 7, in <module>
    from xml import xpath;
  File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/__init__.py", line 105, in <module>
    import Context
  File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/Context.py", line 15, in <module>
    import CoreFunctions
  File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/CoreFunctions.py", line 20, in <module>
    from xml.xpath import Util, Conversions
  File "/usr/lib/python2.6/dist-packages/_xmlplus/xpath/Conversions.py", line 22, in <module>
    from xml.utils import boolean
ImportError: cannot import name boolean

This is because the python-xml package is broken in Ubuntu 9.04 (check this link).

For a quick and dirty fix do the following

sudo mv /usr/lib/python2.6/dist-packages/_xmlplus/utils /usr/lib/python2.6/dist-packages/_xmlplus/utils.backup
sudo cp -r /usr/lib/python2.6/dist-packages/oldxml/_xmlplus/utils/ /usr/lib/python2.6/dist-packages/_xmlplus/utils

External links[edit]