Difference between revisions of "Hunmorph"

From Apertium
Jump to navigation Jump to search
(category)
 
(19 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
'''hunmorph''' is an set of programs for making morphological analysers and generators.
+
'''hunmorph''' is an set of programs for making morphological analysers and generators written largely in Ocaml. Analysers and generators made with these tools could be integrated into an Apertium-based machine translation system, although they would need to be hacked to change the output / input format. Currently it only seems to support Hungarian.
   
 
==Requirements==
 
==Requirements==
   
You will need:
+
On Debian you will need:
   
 
* ocaml
 
* ocaml
 
* ocaml-libs
 
* ocaml-libs
  +
* ocaml-tools
  +
* ocaml-compiler-libs
  +
* ocaml-nox
  +
  +
Get them all by issuing:
  +
<pre>
  +
sudo apt-get install ocaml ocaml-libs ocaml-tools ocaml-compiler-libs ocaml-nox
  +
</pre>
   
 
==Compiling==
 
==Compiling==
  +
  +
Check out the code from CVS, compile the Ocaml code, C bindings and the C wrapper around the morphological analyzer:
   
 
<pre>
 
<pre>
  +
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co ocamorph
 
cd ocamorph
 
cd ocamorph
 
./build.sh build
 
./build.sh build
Line 22: Line 33:
 
</pre>
 
</pre>
   
If you get the error, <code>/usr/bin/ld: cannot find -lunix</code>, then check the Makefile and the include <code>-I</code> paths, probably they don't point to the right place. On Debian I had to change the <code>/usr/lib/ocaml/3.09.1</code> for <code>/usr/lib/ocaml/3.10.1</code>. After you've compiled this you should have an ocamorph binary. Now go back to the root of your CVS tree.
+
If you get the error, <code>/usr/bin/ld: cannot find -lunix</code>, then check the Makefile and the include <code>-I</code> and library <code>-L</code> paths, probably they don't point to the right place. On Debian, I had to change the <code>/usr/lib/ocaml/3.09.1</code> for <code>/usr/lib/ocaml/3.10.1</code> and change <code>/usr/local/lib</code> to <code>/usr/lib</code>. After you've compiled this you should have an ocamorph binary in <code>wrappers/ocamorph</code>. Now go back to the root of your CVS tree.
   
You can test ocamorph with the binary distribution available [http://ftp.mokk.bme.hu/Tool/Hunmorph/Resources/Morphdb.hu/morphdb-hu-20060525.tgz here]. The CVS distribution does not seem to build at the moment. If you untar the file in <code>~/source/</code> you should see:
+
You can test ocamorph with the binary distribution available [
  +
ftp://ftp.mokk.bme.hu/Tool/Hunmorph/Resources/Morphdb.hu/morphdb-hu-20060525.tgz here]. If you untar the file in <code>~/source/</code> you should see:
   
 
<pre>
 
<pre>
Line 38: Line 50:
 
program/NOUN<CAS<ACC>>
 
program/NOUN<CAS<ACC>>
 
</pre>
 
</pre>
  +
  +
;Compiling the lexicon
  +
  +
First check out the morphdb.hu and hunlex.
  +
  +
<pre>
  +
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co lexicons/morphdb.hu
  +
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co hunlex
  +
</pre>
  +
  +
Enter the <code>hunlex/</code> directory, and run <code>make</code>. If there are errors, ignore them, providing that an executable <code>hunlex</code> is created under the <code>src/</code> sub-directory.
  +
  +
Next enter the <code>lexicons/morphdb.hu</code> directory, edit the <code>Makefile</code> and change the paths of <code>HUNLEXMAKEFILE</code> and <code>HUNLEX</code> to point to the directory where you just installed in. Uncomment them if necessary, something like this will probably be the result.
  +
  +
<pre>
  +
HUNLEXMAKEFILE=../../hunlex/adm/HunlexMakefile
  +
HUNLEX=../../hunlex/src/hunlex
  +
</pre>
  +
  +
Now issue the <code>make</code> command. Note, compiling the lexicon could take a long time, and will possibly take up a lot of CPU cycles, consider running it on a different machine than your desktop. The resultant <code>.aff</code> and <code>.dic</code> files will be found in the <code>out/</code> directory.
   
 
==Performance==
 
==Performance==
Line 65: Line 97:
   
 
Final size of the compiled binary is 22Mb.
 
Final size of the compiled binary is 22Mb.
  +
  +
==Further reading==
  +
  +
* Trón, V., Németh, L., Halácsy, P., Kornai, A., Gyepesi, G., and Varga, D. (2005) "[http://aclweb.org/anthology-new/W/W05/W05-1106.pdf Hunmorph: open source word analysis]". ''Proceedings of the ACL 2005 Workshop on Software''. pp. 77--85
   
 
==External links==
 
==External links==
   
* http://mokk.bme.hu/resources/hunmorph
+
* [http://mokk.bme.hu/resources/hunmorph Hunmorph - open source morphological analyzer]
   
[[Category:Tools]]
+
[[Category:Morphological analysers]]
  +
[[Category:Documentation in English]]

Latest revision as of 15:12, 8 July 2012

hunmorph is an set of programs for making morphological analysers and generators written largely in Ocaml. Analysers and generators made with these tools could be integrated into an Apertium-based machine translation system, although they would need to be hacked to change the output / input format. Currently it only seems to support Hungarian.

Requirements[edit]

On Debian you will need:

  • ocaml
  • ocaml-libs
  • ocaml-tools
  • ocaml-compiler-libs
  • ocaml-nox

Get them all by issuing:

sudo apt-get install ocaml ocaml-libs ocaml-tools ocaml-compiler-libs ocaml-nox

Compiling[edit]

Check out the code from CVS, compile the Ocaml code, C bindings and the C wrapper around the morphological analyzer:

cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co ocamorph
cd ocamorph
./build.sh build
cd src/lib
make
cd ../bindings/c
make
cd ../../wrappers/ocamorph
make

If you get the error, /usr/bin/ld: cannot find -lunix, then check the Makefile and the include -I and library -L paths, probably they don't point to the right place. On Debian, I had to change the /usr/lib/ocaml/3.09.1 for /usr/lib/ocaml/3.10.1 and change /usr/local/lib to /usr/lib. After you've compiled this you should have an ocamorph binary in wrappers/ocamorph. Now go back to the root of your CVS tree.

You can test ocamorph with the binary distribution available [ ftp://ftp.mokk.bme.hu/Tool/Hunmorph/Resources/Morphdb.hu/morphdb-hu-20060525.tgz here]. If you untar the file in ~/source/ you should see:

$ ls ~/source/morphdb.hu/
AUTHORS  CVS  doc  LICENCE  morphdb_hu.aff  morphdb_hu.dic  README

You can then test it with:

$ echo "programot" | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic
> programot
program/NOUN<CAS<ACC>>
Compiling the lexicon

First check out the morphdb.hu and hunlex.

cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co lexicons/morphdb.hu 
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co hunlex

Enter the hunlex/ directory, and run make. If there are errors, ignore them, providing that an executable hunlex is created under the src/ sub-directory.

Next enter the lexicons/morphdb.hu directory, edit the Makefile and change the paths of HUNLEXMAKEFILE and HUNLEX to point to the directory where you just installed in. Uncomment them if necessary, something like this will probably be the result.

HUNLEXMAKEFILE=../../hunlex/adm/HunlexMakefile
HUNLEX=../../hunlex/src/hunlex

Now issue the make command. Note, compiling the lexicon could take a long time, and will possibly take up a lot of CPU cycles, consider running it on a different machine than your desktop. The resultant .aff and .dic files will be found in the out/ directory.

Performance[edit]

For a 10,000 line test file, with a analyser with support for 4,000,000 word forms.

$ time cat /tmp/test | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic > /dev/null
real    0m47.224s
user    0m41.859s
sys     0m0.620s

Compile the lexicon using:

$ echo "programot" | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic --bin hu.morph.bin

You seem to be required to attempt to analyse something in order to compile. Then re-test:

$ time cat /tmp/test | ocamorph  --bin hu.morph.bin > /dev/null
real    0m15.023s
user    0m14.625s
sys     0m0.344s

Final size of the compiled binary is 22Mb.

Further reading[edit]

  • Trón, V., Németh, L., Halácsy, P., Kornai, A., Gyepesi, G., and Varga, D. (2005) "Hunmorph: open source word analysis". Proceedings of the ACL 2005 Workshop on Software. pp. 77--85

External links[edit]