hunmorph is an set of programs for making morphological analysers and generators written largely in Ocaml. Analysers and generators made with these tools could be integrated into an Apertium-based machine translation system, although they would need to be hacked to change the output / input format. Currently it only seems to support Hungarian.
On Debian you will need:
Get them all by issuing:
sudo apt-get install ocaml ocaml-libs ocaml-tools ocaml-compiler-libs ocaml-nox
Check out the code from CVS, compile the Ocaml code, C bindings and the C wrapper around the morphological analyzer:
cvs -d :pserver:anonymous:email@example.com:/local/cvs co ocamorph cd ocamorph ./build.sh build cd src/lib make cd ../bindings/c make cd ../../wrappers/ocamorph make
If you get the error,
/usr/bin/ld: cannot find -lunix, then check the Makefile and the include
-I and library
-L paths, probably they don't point to the right place. On Debian, I had to change the
/usr/lib/ocaml/3.10.1 and change
/usr/lib. After you've compiled this you should have an ocamorph binary in
wrappers/ocamorph. Now go back to the root of your CVS tree.
You can test ocamorph with the binary distribution available [
ftp://ftp.mokk.bme.hu/Tool/Hunmorph/Resources/Morphdb.hu/morphdb-hu-20060525.tgz here]. If you untar the file in
~/source/ you should see:
$ ls ~/source/morphdb.hu/ AUTHORS CVS doc LICENCE morphdb_hu.aff morphdb_hu.dic README
You can then test it with:
$ echo "programot" | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic > programot program/NOUN<CAS<ACC>>
- Compiling the lexicon
First check out the morphdb.hu and hunlex.
cvs -d :pserver:anonymous:firstname.lastname@example.org:/local/cvs co lexicons/morphdb.hu cvs -d :pserver:anonymous:email@example.com:/local/cvs co hunlex
hunlex/ directory, and run
make. If there are errors, ignore them, providing that an executable
hunlex is created under the
Next enter the
lexicons/morphdb.hu directory, edit the
Makefile and change the paths of
HUNLEX to point to the directory where you just installed in. Uncomment them if necessary, something like this will probably be the result.
Now issue the
make command. Note, compiling the lexicon could take a long time, and will possibly take up a lot of CPU cycles, consider running it on a different machine than your desktop. The resultant
.dic files will be found in the
For a 10,000 line test file, with a analyser with support for 4,000,000 word forms.
$ time cat /tmp/test | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic > /dev/null real 0m47.224s user 0m41.859s sys 0m0.620s
Compile the lexicon using:
$ echo "programot" | ocamorph --aff ~/source/morphdb.hu/morphdb_hu.aff --dic ~/source/morphdb.hu/morphdb_hu.dic --bin hu.morph.bin
You seem to be required to attempt to analyse something in order to compile. Then re-test:
$ time cat /tmp/test | ocamorph --bin hu.morph.bin > /dev/null real 0m15.023s user 0m14.625s sys 0m0.344s
Final size of the compiled binary is 22Mb.
 Further reading
- Trón, V., Németh, L., Halácsy, P., Kornai, A., Gyepesi, G., and Varga, D. (2005) "Hunmorph: open source word analysis". Proceedings of the ACL 2005 Workshop on Software. pp. 77--85