Difference between revisions of "Maltese and Hebrew/Final report"

From Apertium
Jump to navigation Jump to search
Line 51: Line 51:
   
 
* Maltese Wikipedia (78.4%, 76.6%, 79.6%, 78.4%, std. dev.: 1.23693)
 
* Maltese Wikipedia (78.4%, 76.6%, 79.6%, 78.4%, std. dev.: 1.23693)
* Maltese news sites ( , std. dev.: )
+
* Maltese news sites (80.3%, 77.4%, 79.4%, 80.4% , std. dev.: 1.39134)
 
* Maltese Scannel corpus ( , std. dev.: )
 
* Maltese Scannel corpus ( , std. dev.: )
   

Revision as of 09:28, 26 August 2011

Description

Maltese

Writing the Maltese morphological analyser was the hardest task in the project and required most of the time. That being said, I am very pleased with the results we got.

We used the very little grammar resources we had[1][2] for adding closed-category terms and learning about morphological rules in general.

We then used Maltese frequency lists generated from the various corpora, and categorized terms slowly using (educated) guesses by context and usage.

This was a headache but got very good results; within about 2 weeks (and during my exams period) we got to a ~80% coverage of the Maltese corpora.

Documentation about Maltese morphology & grammar is very sparse and unsatisfying. This presented a huge challenge throughout the whole time. Luckily, my ninja mentors were able to figure out ways to learn what's needed. Additionally, we contacted people who previously researched and worked on Maltese and they all were very nice and glad to help out - we were able to use their works and knowledge in a few critical points in the project.

The verbs

Analysing Maltese verbs has proved to be the biggest issue and I don't feel we got it right yet. We ended up having two scripts that generate verb forms from given stems lists: one I initially wrote using examples from Teach Yourself Maltese and the web, that has a lot of problems and errors in it, and a better one written by Fran who did a much more careful and thorough work. One of the most important things that remains to be done is merging this into one script that's written intelligently using the information laid out in the new grammar book we found.

Hebrew

In comparison, writing he.dix and handling Hebrew generation was fairly easy. Other than my own Hebrew knowledge, this was mostly due to the research I've done before GSoC started (for my application).

We have tweaked some code from hspell, an open-source Hebrew spellchecker project, to get most of the open-category terms. This way we easily got good enough coverage of nouns, verbs, adjectives, etc.

For closed-category terms, I added a lot of them at the beginning of the project, and then fixed what was needed as we went alone with the bidix.

Bidix

The mt→he bidix work was a very 'automatic' task. A lot of the terms we previously added to mt.dix came with gloss so we were able to use it for the translations. For the rest, we used all kinds of translation tools and dictionaries, or learned/guestimated the translation by the context in the corpus. This took a long time and wasn't fun. But we were able to get a good coverage percentage (in most categories, we got to 100%).

Transfer rules

Unfortunately, due to time limitations I did not get to do a lot of these. We wrote a few transfer rules when we recognized obvious transfer errors in some tests, but we didn't have time to properly test the dictionary and go over example sentences.

The things we did fix were very easy to do, probably because of similarities in the grammars of Maltese and Hebrew. So that's promising.

Statistics

Dictionaries
Coverage
  • Maltese Wikipedia (78.4%, 76.6%, 79.6%, 78.4%, std. dev.: 1.23693)
  • Maltese news sites (80.3%, 77.4%, 79.4%, 80.4% , std. dev.: 1.39134)
  • Maltese Scannel corpus ( , std. dev.: )
Rules
Error rate

Future work

Thanks

See Also

Footnotes

  1. J. Aquilina (1994), Teach Yourself Maltese. [1]
  2. A. Borg (1997), Maltese (Comparative Grammar). [2]