Difference between revisions of "Maltese and Hebrew/Final report"

From Apertium
Jump to navigation Jump to search
 
(28 intermediate revisions by 3 users not shown)
Line 8: Line 8:
Writing the Maltese morphological analyser was the hardest task in the project and required most of the time. That being said, I am very pleased with the results we got.
Writing the Maltese morphological analyser was the hardest task in the project and required most of the time. That being said, I am very pleased with the results we got.


We used the very little grammar resources we had<ref>J. Aquilina (1994), Teach Yourself Maltese. [http://books.google.com/books?id=iCdjAAAAMAAJ]</ref><ref>A. Borg (1997), Maltese. [http://books.google.com/books?id=rsA5jUU_3g4C]</ref> for adding closed-category terms and learning about morphological rules in general.
We used the very little grammar resources we had<ref>J. Aquilina (1994), Teach Yourself Maltese. [http://books.google.com/books?id=iCdjAAAAMAAJ]</ref><ref>A. Borg (1997), Maltese (Comparative Grammar). [http://books.google.com/books?id=rsA5jUU_3g4C]</ref> for adding closed-category terms and learning about morphological rules in general.


Verbs were a challenge; lttoolbox pardefs are not capable of fully handling Maltese verbs, so here we created Python scripts that take a CSV file of verb "stems"[http://en.wikipedia.org/wiki/Semitic_root] and some paradigm information, and output full-form lttoolbox entries.
We then used Malteses frequency lists generated from the various corpora, and added them slowly using all kinds of translation tools, dictionaries and/or learning/guestimating by context and usage. This was a headache but got very good results; within about 2 weeks (and during my exams period) we got to a ~80% coverage of the Maltese corpora.

We then used Maltese frequency lists generated from the various corpora, and categorized terms slowly using (educated) guesses by context and usage.

This was a headache but got very good results; within about 2 weeks (and during my exams period) we got to a ~80% coverage of the Maltese corpora.


Documentation about Maltese morphology & grammar is very sparse and unsatisfying. This presented a huge challenge throughout the whole time.
Documentation about Maltese morphology & grammar is very sparse and unsatisfying. This presented a huge challenge throughout the whole time.
Line 22: Line 26:
===Hebrew===
===Hebrew===


In comparison, writing he.dix and handling Hebrew generation was fairly easy. Other than my own Hebrew knowledge, this was mostly due to research I've done before GSoC started (for [[User:N0nick/Application|my application]]).
In comparison, writing he.dix and handling Hebrew generation was fairly easy. Other than my own Hebrew knowledge, this was mostly due to the research I've done before GSoC started (for [[User:N0nick/Application|my application]]).


We have tweaked some code from the [http://hspell.ivrix.org.il/ hspell] Hebrew spellchecker project, to get most of the open-category terms.
We have tweaked some code from [http://hspell.ivrix.org.il/ hspell], an open-source Hebrew spellchecker project, to get most of the open-category terms.
This way we easily got good enough coverage of nouns, verbs, adjectives, etc.
This way we easily got good enough coverage of nouns, verbs, adjectives, etc.


Line 30: Line 34:


===Bidix===
===Bidix===

The mt→he bidix work was a very 'automatic' task. A lot of the terms we previously added to mt.dix came with gloss so we were able to use it for the translations.
For the rest, we used all kinds of translation tools and dictionaries, or learned/guestimated the translation by the context in the corpus.
This took a long time and wasn't fun. But we were able to get a good coverage percentage (in most categories, we got to 99-100%).


===Transfer rules===
===Transfer rules===

Unfortunately, due to time limitations I did not get to do a lot of these.
We wrote a few transfer rules when we recognized obvious transfer errors in some tests, but we didn't have time to properly test the dictionary and go over example sentences.

The things we did fix were very easy to do, probably because of similarities between Maltese and Hebrew grammars. So that's promising.


==Statistics==
==Statistics==


; Dictionaries
; Dictionaries

* <code>apertium.mt-he.mt.dix</code>: <b>3,789</b> lemmata; <b>610,256</b> surface forms.
* <code>apertium.mt-he.he.dix</code>: <b>20,902</b> lemmata; <b>547,272</b> surface forms.


; Coverage
; Coverage


* Maltese Wikipedia ( , std. dev.: )
* Maltese Wikipedia (<b>78.25%</b>, std. dev.: <b>1.23693</b>)
* Maltese news sites ( , std. dev.: )
* Maltese news sites (<b>79.375%</b> , std. dev.: <b>1.39134</b>)
* Maltese Scannel corpus ( , std. dev.: )
* Maltese Scannel corpus (<b>79%</b>, std. dev.: <b>1.9164</b>)


; Rules
; Rules


* <code>apertium.mt-he.mt-he.t1x</code>: <b>16</b> rules.
; Error rate

; Testvoc Summary

<pre>
on. 24. aug. 11:08:12 +0200 2011
=======================================================
POS Total Clean With @ Clean % With # Clean %
prn 520834 518664 0 100 2170 99.58
vblex 5572 5539 0 100 33 99.40
n 3770 2533 463 87.71 774 67.18
adj 2343 403 38 98.37 1902 17.20
np 623 540 83 86.67 0 86.67
adv 156 155 0 100 1 99.35
pr 123 123 0 100 0 100
vaux 92 92 0 100 0 100
num 83 83 0 100 0 100
det 32 32 0 100 0 100
cnjcoo 13 13 0 100 0 100
cnjsub 10 10 0 100 0 100
cnjadv 4 4 0 100 0 100
ij 3 3 0 100 0 100
abbr 3 3 0 100 0 100
rel 1 1 0 100 0 100
=======================================================
</pre>


==Future work==
==Future work==
#Finish working on bidix and monodix files, completing testvoc.
#Retrain the tagger for better handling of Maltese grammar and transfer.
#Add lots more transfer rules.
#Merge the two verb scripts into one that represents a better knowledge of Maltese verbs.
#Fix possessive suffixes on Maltese nouns (only partially done).
#Find out missing gender/number for nouns and adjectives that are marked GD or ND.


==Thanks==
==Thanks==

Mostly I'd like to thank my mentors [[User:Unhammer|Kevin Unhammer]] and [[User:Francis_Tyers|Francis Tyers]] who have been amazingly helpful and always available. No words can be enough to express my gratitude and how happy I am to have known them and worked with them.

Additionally, as I said I have contacted several people along the way and everyone has been kind and helpful. Many thanks is due to everyone involved, among them:

* My CL professor [http://tau.ac.il/~rkatzir/ Dr. Roni Katzir] has provided guidance and contacts regarding Hebrew resources.
* [http://www.math.tau.ac.il/~nachumd/ Prof. Nachum Dershowitz] of the TAU Computer Science faculty has also provided great pointers and contacts.
* The open-source [http://hspell.ivrix.org.il/ hspell] project has been very useful for our work on the Hebrew generator.
* The [http://www.mila.cs.technion.ac.il/ MILA] project for Hebrew CL resources provided access to Hebrew analysers and corpora.
* [http://dingo.sbs.arizona.edu/~ussishkin/ Prof. Adam Ussishkin] of Arizona University consulted us on many subjects regarding Maltese as well as his work on verb conjugation.
* [http://johnjcamilleri.com/ Mr. John J. Camilleri] also provided consultation on various subjects in Maltese, as well as many pointers to helpful resources.
* [http://www.zukunftskolleg.uni-konstanz.de/personen/personen-details/spagnol-michael/6338/2255/ Mr. Michael Spagnol] provided us with his work on Maltese nouns.


==See Also==
==See Also==
* [[User:N0nick/Application|Maltese and Hebrew GSoC Application]].
* [[Maltese and Hebrew]]: Listings of work and research updated as we go.
* [[User:N0nick/GSoC_Journal| My GSoC work journal]] describing my weekly progress.
* [http://xixona.dlsi.ua.es/~fran/maltese/index.php Maltese morphological analysis]: A live online demo of the Maltese analyser we developed.


==Footnotes==
==Footnotes==

Latest revision as of 11:13, 26 August 2011

Description[edit]

Maltese[edit]

Writing the Maltese morphological analyser was the hardest task in the project and required most of the time. That being said, I am very pleased with the results we got.

We used the very little grammar resources we had[1][2] for adding closed-category terms and learning about morphological rules in general.

Verbs were a challenge; lttoolbox pardefs are not capable of fully handling Maltese verbs, so here we created Python scripts that take a CSV file of verb "stems"[3] and some paradigm information, and output full-form lttoolbox entries.

We then used Maltese frequency lists generated from the various corpora, and categorized terms slowly using (educated) guesses by context and usage.

This was a headache but got very good results; within about 2 weeks (and during my exams period) we got to a ~80% coverage of the Maltese corpora.

Documentation about Maltese morphology & grammar is very sparse and unsatisfying. This presented a huge challenge throughout the whole time. Luckily, my ninja mentors were able to figure out ways to learn what's needed. Additionally, we contacted people who previously researched and worked on Maltese and they all were very nice and glad to help out - we were able to use their works and knowledge in a few critical points in the project.

The verbs

Analysing Maltese verbs has proved to be the biggest issue and I don't feel we got it right yet. We ended up having two scripts that generate verb forms from given stems lists: one I initially wrote using examples from Teach Yourself Maltese and the web, that has a lot of problems and errors in it, and a better one written by Fran who did a much more careful and thorough work. One of the most important things that remains to be done is merging this into one script that's written intelligently using the information laid out in the new grammar book we found.

Hebrew[edit]

In comparison, writing he.dix and handling Hebrew generation was fairly easy. Other than my own Hebrew knowledge, this was mostly due to the research I've done before GSoC started (for my application).

We have tweaked some code from hspell, an open-source Hebrew spellchecker project, to get most of the open-category terms. This way we easily got good enough coverage of nouns, verbs, adjectives, etc.

For closed-category terms, I added a lot of them at the beginning of the project, and then fixed what was needed as we went alone with the bidix.

Bidix[edit]

The mt→he bidix work was a very 'automatic' task. A lot of the terms we previously added to mt.dix came with gloss so we were able to use it for the translations. For the rest, we used all kinds of translation tools and dictionaries, or learned/guestimated the translation by the context in the corpus. This took a long time and wasn't fun. But we were able to get a good coverage percentage (in most categories, we got to 99-100%).

Transfer rules[edit]

Unfortunately, due to time limitations I did not get to do a lot of these. We wrote a few transfer rules when we recognized obvious transfer errors in some tests, but we didn't have time to properly test the dictionary and go over example sentences.

The things we did fix were very easy to do, probably because of similarities between Maltese and Hebrew grammars. So that's promising.

Statistics[edit]

Dictionaries
  • apertium.mt-he.mt.dix: 3,789 lemmata; 610,256 surface forms.
  • apertium.mt-he.he.dix: 20,902 lemmata; 547,272 surface forms.
Coverage
  • Maltese Wikipedia (78.25%, std. dev.: 1.23693)
  • Maltese news sites (79.375% , std. dev.: 1.39134)
  • Maltese Scannel corpus (79%, std. dev.: 1.9164)
Rules
  • apertium.mt-he.mt-he.t1x: 16 rules.
Testvoc Summary
on. 24. aug. 11:08:12 +0200 2011
=======================================================
POS	Total	Clean	With @	Clean %	With #	Clean %
prn    	520834	518664	0	100	2170	99.58
vblex  	5572	5539	0	100	33	99.40
n      	3770	2533	463	87.71	774	67.18
adj    	2343	403	38	98.37	1902	17.20
np     	623	540	83	86.67	0	86.67
adv    	156	155	0	100	1	99.35
pr     	123	123	0	100	0	100
vaux   	92	92	0	100	0	100
num    	83	83	0	100	0	100
det    	32	32	0	100	0	100
cnjcoo 	13	13	0	100	0	100
cnjsub 	10	10	0	100	0	100
cnjadv 	4	4	0	100	0	100
ij     	3	3	0	100	0	100
abbr   	3	3	0	100	0	100
rel    	1	1	0	100	0	100
=======================================================

Future work[edit]

  1. Finish working on bidix and monodix files, completing testvoc.
  2. Retrain the tagger for better handling of Maltese grammar and transfer.
  3. Add lots more transfer rules.
  4. Merge the two verb scripts into one that represents a better knowledge of Maltese verbs.
  5. Fix possessive suffixes on Maltese nouns (only partially done).
  6. Find out missing gender/number for nouns and adjectives that are marked GD or ND.

Thanks[edit]

Mostly I'd like to thank my mentors Kevin Unhammer and Francis Tyers who have been amazingly helpful and always available. No words can be enough to express my gratitude and how happy I am to have known them and worked with them.

Additionally, as I said I have contacted several people along the way and everyone has been kind and helpful. Many thanks is due to everyone involved, among them:

  • My CL professor Dr. Roni Katzir has provided guidance and contacts regarding Hebrew resources.
  • Prof. Nachum Dershowitz of the TAU Computer Science faculty has also provided great pointers and contacts.
  • The open-source hspell project has been very useful for our work on the Hebrew generator.
  • The MILA project for Hebrew CL resources provided access to Hebrew analysers and corpora.
  • Prof. Adam Ussishkin of Arizona University consulted us on many subjects regarding Maltese as well as his work on verb conjugation.
  • Mr. John J. Camilleri also provided consultation on various subjects in Maltese, as well as many pointers to helpful resources.
  • Mr. Michael Spagnol provided us with his work on Maltese nouns.

See Also[edit]

Footnotes[edit]

  1. J. Aquilina (1994), Teach Yourself Maltese. [1]
  2. A. Borg (1997), Maltese (Comparative Grammar). [2]