Hunmorph
hunmorph is an set of programs for making morphological analysers and generators written largely in Ocaml. Analysers and generators made with these tools could be integrated into an Apertium-based machine translation system, although they would need to be hacked to change the output / input format. Currently it only seems to support Hungarian.
Requirements
On Debian you will need:
- ocaml
- ocaml-libs
- ocaml-tools
- ocaml-compiler-libs
- ocaml-nox
Compiling
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co ocamorph cd ocamorph ./build.sh build cd src/lib make cd ../bindings/c make cd ../../wrappers/ocamorph make
If you get the error, /usr/bin/ld: cannot find -lunix
, then check the Makefile and the include -I
paths, probably they don't point to the right place. On Debian I had to change the /usr/lib/ocaml/3.09.1
for /usr/lib/ocaml/3.10.1
. After you've compiled this you should have an ocamorph binary. Now go back to the root of your CVS tree.
You can test ocamorph with the binary distribution available here. If you untar the file in ~/source/
you should see:
$ ls ~/source/morphdb.hu/ AUTHORS CVS doc LICENCE morphdb_hu.aff morphdb_hu.dic README
You can then test it with:
$ echo "programot" | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic > programot program/NOUN<CAS<ACC>>
- Compiling the lexicon
First check out the morphdb.hu and hunlex.
$ cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co lexicons/morphdb.hu $ cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co hunlex
Enter the hunlex/
directory, and run make
. If there are errors, ignore them, providing that an executable hunlex
is created under the src/
sub-directory.
Next enter the lexicons/morphdb.hu
directory, edit the Makefile
and change the paths of HUNLEXMAKEFILE
and HUNLEX
to point to the directory where you just installed in. Uncomment them if necessary, something like this will probably be the result.
HUNLEXMAKEFILE=../../hunlex/adm/HunlexMakefile HUNLEX=../../hunlex/src/hunlex
Now issue the make
command. Note, compiling the lexicon could take a long time, and will possibly take up a lot of CPU cycles, consider running it on a different machine than your desktop. The resultant .aff
and .dic
files will be found in the out/
directory.
Performance
For a 10,000 line test file, with a analyser with support for 4,000,000 word forms.
$ time cat /tmp/test | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic > /dev/null real 0m47.224s user 0m41.859s sys 0m0.620s
Compile the lexicon using:
$ echo "programot" | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic --bin hu.morph.bin
You seem to be required to attempt to analyse something in order to compile. Then re-test:
$ time cat /tmp/test | ocamorph --bin hu.morph.bin > /dev/null real 0m15.023s user 0m14.625s sys 0m0.344s
Final size of the compiled binary is 22Mb.
Further reading
- Trón, V., Németh, L., Halácsy, P., Kornai, A., Gyepesi, G., and Varga, D. (2005) "Hunmorph: open source word analysis". Proceedings of the ACL 2005 Workshop on Software. pp. 77--85