Difference between revisions of "Foma"
(19 intermediate revisions by 8 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{TOCD}} |
||
'''foma''' is a finite-state toolkit that implements Xerox lexc and xfst. It can be used for building morphologies of natural languages. |
'''foma''' is a finite-state toolkit that implements Xerox lexc and xfst. It can be used for building morphologies of natural languages. |
||
== Installation == |
== Installation == |
||
− | Note: foma requires <code>libreadline</code> to be installed, on Debian or Ubuntu use <code>apt-get install |
+ | Note: foma requires <code>libreadline</code> to be installed, on Debian or Ubuntu use <code>apt-get install libreadline-dev</code> |
+ | Note: foma requires <code>zlib1g-dev</code> to be installed, on Debian use <code>apt-get install zlib1g-dev</code> |
||
− | * Download the .tar.gz source from the website. |
||
+ | |||
− | * Untar |
||
+ | <pre>wget http://dingo.sbs.arizona.edu/~mhulden/foma-0.9.15alpha.tar.gz |
||
− | * Run <code>make</code> |
||
+ | tar -xzvf foma-0.9.15alpha.tar.gz |
||
+ | cd foma |
||
+ | make |
||
+ | sudo make install</pre> |
||
+ | or, from svn: |
||
+ | <pre>svn checkout http://foma.googlecode.com/svn/trunk/foma/ foma |
||
+ | cd foma |
||
+ | #run if you do not run with sudo: sed -i.tmp "s%prefix = /usr/local%prefix = $PREFIX%" Makefile |
||
+ | make |
||
+ | sudo make install</pre> |
||
+ | |||
+ | ===Installation troubleshooting=== |
||
+ | If you get an error about -fPIC (happens on Arch Linux), do: |
||
+ | <pre>make clean |
||
+ | make CFLAGS=-fPIC |
||
+ | sudo make install</pre> |
||
+ | |||
+ | If you get an error like <pre>/usr/bin/ld: cannot find -ltermcap |
||
+ | collect2: ld returned 1 exit status |
||
+ | make: *** [libfoma] Error 1</pre> when running make, open the Makefile and change the <code>-ltermcap</code> to <code>-lncurses</code> (happens on Arch Linux and OpenSUSE). |
||
If you get an error <code>Makefile:12: *** missing separator. Stop.</code>, edit the Makefile and add <code>\</code> to the end of the lines 11--13. |
If you get an error <code>Makefile:12: *** missing separator. Stop.</code>, edit the Makefile and add <code>\</code> to the end of the lines 11--13. |
||
+ | If you get an error like this (I got it running ubuntu 11.10): |
||
− | * This will create a binary <code>foma</code>, which should be copied into your <code>PATH</code>. |
||
+ | <pre> |
||
+ | /usr/bin/ld: int_stack.o: relocation R_X86_64_32S against `.bss' can not be used when making a shared object; recompile with -fPIC |
||
+ | int_stack.o: could not read symbols: Bad value |
||
+ | collect2: ld returned 1 exit status |
||
+ | make: *** [libfoma] Error 1 |
||
+ | </pre> |
||
+ | edit the Makefile and change a line that looks like this |
||
+ | <pre>CFLAGS = -O3 -Wall -D_GNU_SOURCE -std=c99 -fvisibility=hidden</pre> |
||
+ | to this |
||
+ | <pre>CFLAGS = -O3 -Wall -D_GNU_SOURCE -std=c99 -fvisibility=hidden -fPIC</pre> |
||
+ | |||
+ | == Example usage == |
||
+ | |||
+ | First check out the Greenlandic (<code>kal</code>) morphology from Giellatekno SVN: |
||
+ | |||
+ | <pre> |
||
+ | $ svn co https://victorio.uit.no/langtech/trunk/st/kal |
||
+ | </pre> |
||
+ | |||
+ | Move to the <code>src/</code> directory and combine all the <code>lexc</code> source files: |
||
+ | |||
+ | <pre> |
||
+ | $ cat kal-lex.txt \ |
||
+ | abbr-kal-lex.txt acro-kal-lex.txt \ |
||
+ | noun-kal-lex.txt verb-kal-lex.txt \ |
||
+ | ateq-kal-lex.txt ateq-kal-morph.txt \ |
||
+ | punct-kal-lex.txt prt-kal-lex.txt num-kal-lex.txt > kal-lex-all.lexc |
||
+ | </pre> |
||
+ | |||
+ | Next, remove the comments from the <code>xfst</code> rewrite rule file: |
||
+ | |||
+ | <pre> |
||
+ | $ cat xfst-kal.txt | sed 's/\s\!.*$/ /g' | grep -v '^!' | sed 's/$/ /g' | grep -v 'echo' > xfst-kal.tmp |
||
+ | </pre> |
||
+ | |||
+ | Compile the <code>xfst</code> code as follows, run foma and load the rewrite rules: |
||
+ | |||
+ | <pre> |
||
+ | foma[0]: source xfst-kal.tmp |
||
+ | Opening file 'xfst-kal.tmp'. |
||
+ | defined Vow: 348 bytes. 2 states, 6 arcs, 6 paths. |
||
+ | defined Cns: 741 bytes. 2 states, 19 arcs, 19 paths. |
||
+ | ... |
||
+ | 6.1 MB. 12474 states, 402541 arcs, Cyclic. |
||
+ | foma[1]: |
||
+ | </pre> |
||
+ | |||
+ | Note the <code>[1]</code>, if you don't get this something has gone wrong. |
||
+ | |||
+ | Next, save the compiled transducer and quit: |
||
+ | |||
+ | <pre> |
||
+ | foma[1]: save stack xfst-kal.bin |
||
+ | Writing to file xfst-kal.bin. |
||
+ | foma[1]: quit |
||
+ | </pre> |
||
+ | |||
+ | Now we compile the lexc file and save the resulting transducer and quit: |
||
+ | |||
+ | <pre> |
||
+ | $ foma |
||
+ | foma[0]: read lexc kal-lex-all.lexc |
||
+ | Root...8, Z1Zmorf...59, Z1SZmorf...56, Z1PZmorf...59, Z1+ssZmorf...59, ... |
||
+ | Building lexicon...Determinizing...Minimizing...Done! |
||
+ | 85.5 MB. 154826 states, 5599566 arcs, Cyclic. |
||
+ | foma[1]: save stack kal-lex.save |
||
+ | Writing to file kal-lex.save. |
||
+ | foma[1]: quit |
||
+ | <pre> |
||
+ | |||
+ | The final step is to compose the two transducers (the lexicon and the rewrite rules), |
||
+ | |||
+ | <pre> |
||
+ | $ foma |
||
+ | foma[0]: regex [[@"kal-lex.save"] .o. [[@"kal-lex.save"].l .o. [@"xfst-kal.bin"]] ] ; |
||
+ | </pre> |
||
+ | |||
+ | This final step takes some time, up to 2—3 minutes. It also takes a lot of processing power and RAM. The final result will be: |
||
+ | |||
+ | <pre> |
||
+ | 76.4 MB. 160041 states, 5002206 arcs, Cyclic. |
||
+ | foma[1]: |
||
+ | </pre> |
||
+ | |||
+ | Then save the final transducer, and quit: |
||
+ | |||
+ | <pre> |
||
+ | foma[1]: save stack kal.morph.bin |
||
+ | Writing to file kal.morph.bin. |
||
+ | foma[1]: quit |
||
+ | </pre> |
||
+ | |||
+ | You can now use the transducer for analysis and generation, for example, |
||
+ | |||
+ | <pre> |
||
+ | $ foma |
||
+ | foma[0]: load kal.morph.bin |
||
+ | 76.4 MB. 160041 states, 5002206 arcs, Cyclic. |
||
+ | foma[1]: apply up nittartagaq |
||
+ | nittar+TAR+vv+TAQ+N+Abs+Sg |
||
+ | foma[1]: apply up kalaallisut |
||
+ | kalaaleq+N+Aeq+Pl |
||
+ | kalaaleq+N+Aeq+Sg |
||
+ | foma[1]: apply down kalaaleq+N+Aeq+Sg |
||
+ | kalaallitut |
||
+ | kalaallisut |
||
+ | </pre> |
||
+ | |||
+ | |||
+ | === Visualising an Apertium transducer === |
||
+ | |||
+ | <pre> |
||
+ | $ lt-print no-en.autobil.bin > /tmp/no-en.txt |
||
+ | |||
+ | $ foma |
||
+ | foma[0]: read att /tmp/no-en.txt |
||
+ | foma[1]: view |
||
+ | </pre> |
||
+ | |||
+ | Make sure you've install a .dot renderer for converting the file to PNG. On Ubuntu its done by: |
||
+ | <pre> |
||
+ | $ sudo apt-get install graphviz |
||
+ | </pre> |
||
+ | |||
+ | |||
+ | You could also put this script in a file `lt-view` and then `lt-view foo.automorf.bin >foo.png`: |
||
+ | <pre> |
||
+ | #!/bin/sh |
||
+ | |||
+ | set -e -u |
||
+ | |||
+ | if ! command -V dot >/dev/null; then |
||
+ | echo "Please install graphviz (e.g. apt install graphviz)" >&2 |
||
+ | exit 1 |
||
+ | elif ! command -V foma >/dev/null; then |
||
+ | echo "Please install foma (e.g. apt install foma)" >&2 |
||
+ | exit 1 |
||
+ | elif [ $# -ne 1 ]; then |
||
+ | echo "Expecting an lttoolbox binary as arg 1, no other args" >&2 |
||
+ | exit 1 |
||
+ | elif [ -t 1 ]; then |
||
+ | echo "This will write a png file – you should redirect, e.g. $* > fst.png" >&2 |
||
+ | exit 1 |
||
+ | fi |
||
+ | |||
+ | tmpd=$(mktemp -dt lt-view.XXXXXXXXXXX) |
||
+ | trap 'rm -rf "${tmpd}"' EXIT |
||
+ | |||
+ | lt-print "$1" > "${tmpd}"/att |
||
+ | |||
+ | printf 'read att %s\nprint dot >%s\n' "${tmpd}"/att "${tmpd}"/dot | foma >/dev/null |
||
+ | |||
+ | dot -Tpng "${tmpd}"/dot |
||
+ | </pre> |
||
== External links == |
== External links == |
||
* http://foma.sourceforge.net/ |
* http://foma.sourceforge.net/ |
||
− | * [https://victorio.uit.no/langtech/trunk/st Giellatekno SVN] |
+ | * [https://victorio.uit.no/langtech/trunk/st Giellatekno SVN] — here you can find some example morphologies in foma format. |
− | [[Category: |
+ | [[Category:Morphological analysers]] |
Latest revision as of 08:54, 23 September 2022
foma is a finite-state toolkit that implements Xerox lexc and xfst. It can be used for building morphologies of natural languages.
Installation[edit]
Note: foma requires libreadline
to be installed, on Debian or Ubuntu use apt-get install libreadline-dev
Note: foma requires zlib1g-dev
to be installed, on Debian use apt-get install zlib1g-dev
wget http://dingo.sbs.arizona.edu/~mhulden/foma-0.9.15alpha.tar.gz tar -xzvf foma-0.9.15alpha.tar.gz cd foma make sudo make install
or, from svn:
svn checkout http://foma.googlecode.com/svn/trunk/foma/ foma cd foma #run if you do not run with sudo: sed -i.tmp "s%prefix = /usr/local%prefix = $PREFIX%" Makefile make sudo make install
Installation troubleshooting[edit]
If you get an error about -fPIC (happens on Arch Linux), do:
make clean make CFLAGS=-fPIC sudo make install
If you get an error like
/usr/bin/ld: cannot find -ltermcap collect2: ld returned 1 exit status make: *** [libfoma] Error 1
when running make, open the Makefile and change the -ltermcap
to -lncurses
(happens on Arch Linux and OpenSUSE).
If you get an error Makefile:12: *** missing separator. Stop.
, edit the Makefile and add \
to the end of the lines 11--13.
If you get an error like this (I got it running ubuntu 11.10):
/usr/bin/ld: int_stack.o: relocation R_X86_64_32S against `.bss' can not be used when making a shared object; recompile with -fPIC int_stack.o: could not read symbols: Bad value collect2: ld returned 1 exit status make: *** [libfoma] Error 1
edit the Makefile and change a line that looks like this
CFLAGS = -O3 -Wall -D_GNU_SOURCE -std=c99 -fvisibility=hidden
to this
CFLAGS = -O3 -Wall -D_GNU_SOURCE -std=c99 -fvisibility=hidden -fPIC
Example usage[edit]
First check out the Greenlandic (kal
) morphology from Giellatekno SVN:
$ svn co https://victorio.uit.no/langtech/trunk/st/kal
Move to the src/
directory and combine all the lexc
source files:
$ cat kal-lex.txt \ abbr-kal-lex.txt acro-kal-lex.txt \ noun-kal-lex.txt verb-kal-lex.txt \ ateq-kal-lex.txt ateq-kal-morph.txt \ punct-kal-lex.txt prt-kal-lex.txt num-kal-lex.txt > kal-lex-all.lexc
Next, remove the comments from the xfst
rewrite rule file:
$ cat xfst-kal.txt | sed 's/\s\!.*$/ /g' | grep -v '^!' | sed 's/$/ /g' | grep -v 'echo' > xfst-kal.tmp
Compile the xfst
code as follows, run foma and load the rewrite rules:
foma[0]: source xfst-kal.tmp Opening file 'xfst-kal.tmp'. defined Vow: 348 bytes. 2 states, 6 arcs, 6 paths. defined Cns: 741 bytes. 2 states, 19 arcs, 19 paths. ... 6.1 MB. 12474 states, 402541 arcs, Cyclic. foma[1]:
Note the [1]
, if you don't get this something has gone wrong.
Next, save the compiled transducer and quit:
foma[1]: save stack xfst-kal.bin Writing to file xfst-kal.bin. foma[1]: quit
Now we compile the lexc file and save the resulting transducer and quit:
$ foma foma[0]: read lexc kal-lex-all.lexc Root...8, Z1Zmorf...59, Z1SZmorf...56, Z1PZmorf...59, Z1+ssZmorf...59, ... Building lexicon...Determinizing...Minimizing...Done! 85.5 MB. 154826 states, 5599566 arcs, Cyclic. foma[1]: save stack kal-lex.save Writing to file kal-lex.save. foma[1]: quit <pre> The final step is to compose the two transducers (the lexicon and the rewrite rules), <pre> $ foma foma[0]: regex [[@"kal-lex.save"] .o. [[@"kal-lex.save"].l .o. [@"xfst-kal.bin"]] ] ;
This final step takes some time, up to 2—3 minutes. It also takes a lot of processing power and RAM. The final result will be:
76.4 MB. 160041 states, 5002206 arcs, Cyclic. foma[1]:
Then save the final transducer, and quit:
foma[1]: save stack kal.morph.bin Writing to file kal.morph.bin. foma[1]: quit
You can now use the transducer for analysis and generation, for example,
$ foma foma[0]: load kal.morph.bin 76.4 MB. 160041 states, 5002206 arcs, Cyclic. foma[1]: apply up nittartagaq nittar+TAR+vv+TAQ+N+Abs+Sg foma[1]: apply up kalaallisut kalaaleq+N+Aeq+Pl kalaaleq+N+Aeq+Sg foma[1]: apply down kalaaleq+N+Aeq+Sg kalaallitut kalaallisut
Visualising an Apertium transducer[edit]
$ lt-print no-en.autobil.bin > /tmp/no-en.txt $ foma foma[0]: read att /tmp/no-en.txt foma[1]: view
Make sure you've install a .dot renderer for converting the file to PNG. On Ubuntu its done by:
$ sudo apt-get install graphviz
You could also put this script in a file `lt-view` and then `lt-view foo.automorf.bin >foo.png`:
#!/bin/sh set -e -u if ! command -V dot >/dev/null; then echo "Please install graphviz (e.g. apt install graphviz)" >&2 exit 1 elif ! command -V foma >/dev/null; then echo "Please install foma (e.g. apt install foma)" >&2 exit 1 elif [ $# -ne 1 ]; then echo "Expecting an lttoolbox binary as arg 1, no other args" >&2 exit 1 elif [ -t 1 ]; then echo "This will write a png file – you should redirect, e.g. $* > fst.png" >&2 exit 1 fi tmpd=$(mktemp -dt lt-view.XXXXXXXXXXX) trap 'rm -rf "${tmpd}"' EXIT lt-print "$1" > "${tmpd}"/att printf 'read att %s\nprint dot >%s\n' "${tmpd}"/att "${tmpd}"/dot | foma >/dev/null dot -Tpng "${tmpd}"/dot
External links[edit]
- http://foma.sourceforge.net/
- Giellatekno SVN — here you can find some example morphologies in foma format.