Difference between revisions of "Arabic"
Jump to navigation
Jump to search
Youssefsan (talk | contribs) (+ intro) |
Youssefsan (talk | contribs) (+ link) |
||
Line 4: | Line 4: | ||
Developping other semitic languages pair with Arabic would be a good idea. Example: [[Tamazight]], [[Hebrew]]. |
Developping other semitic languages pair with Arabic would be a good idea. Example: [[Tamazight]], [[Hebrew]]. |
||
== See also == |
|||
* [[List of language pairs]] |
|||
==Resources== |
==Resources== |
Revision as of 17:55, 2 December 2012
Arabic is a semitic language (http://en.wikipedia.org/wiki/Hamito-Semitic). There is currenly only one pair in development:
- staging mt-ar: Maltese-Arabic. http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-mt-ar/
Developping other semitic languages pair with Arabic would be a good idea. Example: Tamazight, Hebrew.
See also
Resources
- Sarf - Arabic Morphology System (all in Java...)
- AraMorph - Perl - An Arabic morphological analyzer and part-of-speech tagger written in Perl (originally by Tim Buckwalter, see http://www.qamus.org/morphology.htm)
- AraMorph - Java - An Arabic morphological analyzer and part-of-speech tagger rewritten in Java for Lucene
- Arabic dictionaries, by Jon Dehdari, for the Link-Grammar parser. These require the Aramorph stemming package, above.
- ElixirFM (online interface here) is a Functional Arabic Morphology written in Haskell and Perl; the lexicon is a "re-processed" version of the Buckwalter analyser.
- There is a good documentation of how to make a morphological analyser for Arabic (and Semitic languages in general) in the Beesley/Karttunen finite state transducer book, documenting the Xerox compiler (Ken Beesley also made an Arabic fst). Also, there now is an open source compiler reading the Xerox format, the HFST compiler.
- And there is also an open source finite state morphological analyser for Arabic, AraComLex (online interface here). Among other resources related to AraComLex there is a list of Arabic morphological patterns and a frequency word list from a 1 billion word corpus.
Corpora
- Meedan-Memory, Arabic-English TMX (sentence-aligned), ~467,000 words on the English side, Open Database Licence
- Quranic Arabic Corpus, 77,430 words of Quranic Arabic, with manually verified contextual POS, inflection, derivation; dependency grammar annotation is planned.