From Apertium
Jump to navigation Jump to search



The persian language has quite specific grammar requirements. A sentence structure must be in the form of SOV (subject-object-verb). The main clause in a sentence precedes a subordinate clause. The interrogative particle, āyā which asks a yes or no question in persian must appear at the beginning of a sentence. The unique thing about the persian language though, is that they use prepositions which is quite uncommon in many SOV languages. The one case marker, rā follows the definite direct object noun phrase.


Persian nouns have no grammatical gender, unlike other languages such as latin. Persian nouns mark with an accusative marker only for the specific accusative case of words, while other oblique cases are marked by prepositions. Possession is expressed by special markers: if the possessor appears in the sentence after the thing possessed, the ezāfe may be used; otherwise alternatively a pronominal genitive enclitic is used. Inanimate nouns pluralize with -hā, of which animate nouns generally pluralize with -ān although there are some special cases which end in -gān and -yān although -hā is the most common. Special rules exist for some nouns borrowed from Arabic.

Natural language processing following is a sub-branch of artificial intelligence so that natural language that is used for communication between creatures converts to an artificial form. Meaning of morphology is a word of what components formed and how these components are put together and create a word. First of all, in this paper we extract total grammatical of noun and adjective in Persian (farsi) languageIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universeabout 86, 113 rules respectively, and write their Lexicon in Lexc language, then designed a Two-sided morphology analyst of nouns and adjectives in Persian language, using Xerox Finite State Technology as giving input word (adjective or noun) it breaks its components or to giving components with their parts of speech can generates an adjective or noun.


Adjectives typically follow the nouns they modify, using the izāfa construct. However, adjectives can precede nouns in compounded derivational forms, such as khosh-bækht (lit. good-luck) 'lucky', and bæd-kār (lit. bad-deed) 'wicked'. Comparative forms ("more ...") make use of the suffix tær (تَر), while the superlative form ("the most ...") uses the suffix tærin (تَرین). Comparatives used attributively follow the nouns they modify, while superlatives precede their nouns. With respect to comparison, "than" is expressed by the preposition "از" (az), for example: سگ من از گربه‌ی تو کوچک‌تراست (Sag-e man az gorbe to kuchektar ast; My dog is smaller than your cat.)


Normal verbs can be formed using the following morpheme pattern: ( NEG - DUR or SUBJ/IMPER ) - root - PAST - PERSON - ACC-ENCLITIC • Negative prefix: næ - changes to ne before the Durative prefix • Durative prefix: mi • Subjunctive/Imperative prefix: be • Past suffix: d - changes to t after unvoiced consonants

Farsi verbs are words that convey action (bring, read, walk, run), or a state of being (exist, stand). In most languages a verb may agree with the person, gender, and/or number of some of its arguments, such as its subject, or object.


Farsi articles are words that combine with a noun to indicate the type of reference being made by the noun. Generally articles specify the grammatical definiteness of the noun. Examples are "the, a, and an". Here are some examples:

English Articles Farsi Articles
articles ḥerof te'ereyef - حروف تعریف
a y - ی
one yek - یک
some berekhey āz - برخی از
few me'edod - معدود
the book ketāb - کتاب
the books ketāb hā - کتاب ها
a book ketābey - کتابی
one book yek ketāb - یک کتاب
some books be'ez̤ey āz ketāb hā - بعضی از کتاب ها
few books chenedetā ketāb - چندتا کتاب


Mood Tense Romanisation Persian
Indicative Present mikhoræd میخورد
Indicative Preterite khord خورد
Indicative Imperfective preterite mikhord میخورد
Indicative Perfect khordeæst خوردهاست
Indicative Imperfective perfect mikhordeæst میخوردهاست
Indicative Pluperfect khorde bud خورده بود
Indicative Imperfective pluperfect mikhorde bud میخورده بود
Indicative Future khāhæd khord خواهد خورد
Indicative Present progressive dāræd mikhoræd دارد میخورد
Indicative Preterite progressive dāsht mikhord داشت میخورد
Subjunctive Present bekhoræd بخورد
Subjunctive Preterite khorde bāshæd خورده باشد
Subjunctive Imperfective preterite mikhorde bāshæd میخورده باشد
Subjunctive Pluperfect khorde bude bāshæd خورده بوده باشد
Subjunctive Imperfective pluperfect mikhorde bude bāshæd میخورده بوده باشد

There is an enclitic -у, "and" in Tajik, this will probably be written as ZWNJ+و in Perso-Arabic. The form in Tajik can change after a vowel to -ю or -ву but all should be translated ZWNJ+و or Space+و in Persian.


Persian is a pro-drop language, so personal pronouns (e.g. he, she , I) are optional. Pronouns generally are the same for all cases. Pronominal genitive enclitics are different from normal pronouns. Farsi pronouns include personal pronouns (refer to the persons speaking, the persons spoken to, or the persons or things spoken about), indefinite pronouns, relative pronouns (connect parts of sentences) and reciprocal or reflexive pronouns (in which the object of a verb is being acted on by verb's subject). Here are some examples:    


Tenses in Persian are actually quite simple, and the conjugation charts above already explain many rules by themselves. These are the most common tenses: Infinitive: The infinitive ending is formed with -ن (æn), e.g. خوردن (khōrdæn) 'to eat.' The basic stem of the verb is formed by deleting this ending. Past: The past tense is formed by deleting the infinitive ending and adding the conjugations to the stem. There are virtually no irregular verbs in the past tense, unlike English. In the third person singular, there is no conjugation, so 'خوردن' would become 'خورد'(khōrd),he/she/it ate. Perfect: The perfect tense is formed by taking the stem of the verb, adding ه(eh) to the end, and then adding the conjugations. The endings are pronounced with an 'a,' separately from the 'ه'. So 'خوردن' in the perfect first person singular would be 'خورده ایم' (khōrde æm), I have eaten. The spelling is also notable, since the 'ی' isn't pronounced. As with the past tense, the third person singular ending is also irregular, i.e. it's -است. Thus, 'خوردن' would become 'خورده است' (khōrde æst). Pluperfect: The pluperfect is formed by taking the stem of the perfect, e.g. 'خورده,' adding 'بود'(būd),and finally adding the conjugations to the end, thus 'خورده بودم'(khōrde būdæm), I had eaten. In the third person singular, either simply no conjugation or -است is accepted. 'بود' means 'was,' and an interesting philological note is that Latin forms its pluperfect the same way; 'Eram', 'I was,' for example, could be added to the perfect verb 'auffugi,'I had escaped,' and thus, 'auffugeram,' 'I had escaped.' Future: The future tense is formed by first, taking the present tense form of 'خاستن' (khāstæn), to want, and conjugating it to the correct person; this verb in third person singular is 'خواهد' (mī khāhæd). Next, it is put in front of the unconjugated stem of the verb, e.g. خورد, thus 'خواهد خورد,' he/she/it will eat. For compound verbs, such as 'تمیز کردن' (tæmīz kærdæn), 'to clean, refresh,' خواهد goes in between both words, and 'کردن' is reduced to its stem, thus تمیز خواهد کرد (tæmīz khāhæd kærd), he/she/it will eat. In the negative, 'خواهد' receives -ن. Present: The present tense is the most difficult tense in Persian because it is completely irregular. It is formed by finding the root of the word, adding the prefix 'می'(mī), and then conjugating it. The third person singular conjugation is -د, and this is probably why the past tense has no conjugation, since many stems already end in a 'd.' The root of the verb 'خوردن,' for example, is 'خور'(khōr), so the present first person singular would be 'می خورم'(mī khōræm), I eat, am eating, do eat. The negative -ن is pronounced 'ne' before 'mī,' but in all other tenses is pronounced 'næ.'The present tense in Persian should not be confused with the tenses in Semitic languages, since many roots are etymologically unrelated to their infinitives, and there's no solid rule that all verbs follow; however, one will notice after acquiring some knowledge of Persian verbs that there are a few general patterns that a few similar verbs follow; for example, with a verb containing -ختن, such as 'ساختن' (sākhtæn),'to make, build' the -ختن is replaced with ز, thus the root is 'ساز' (sāz). The present tense construction also has more than just one use. It can also be used in infinitive constructions and imperatives. In the English sentence 'I want to eat,' the Persian translation would be می خواهم بخورم(mī khāhæm bekhōræm).'بخورم' is actually just another form of the present tense, only instead of using the suffix 'می,' it uses -ب(be). This -ب can also be used to form imperatives by attaching it to the present tense root, thus the imperative form of 'خوردن' would be 'بخور,' but could also be 'خورید' or simply just 'خور.'

Text Corpus[edit]

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics. Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. Corpora can be considered as a type of foreign language writing aid as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing


Although the name of the language has been maintained as Persian or Parsi or its Arabic form Farsi (because in Arabic they do not have the letter P) the language has undergone great changes and can be categorised into the following groups. 1. Old Persian 2. Middle Persian 3. Classical Persian 4. Modern Persian Old Persian is what the original Parsa tribe of the Hakahmaneshinian (Achaemenid) era spoke and they have left for us samples carved on stone in cuneiform script.

Middle Persian is the language spoken during the Sasanian era also known as Pahlavi. We have plenty of writings from that era in the form of religious writings of the Zarathushti religion, namely the Bundahish, Arda Viraf nameh, Mainu Khared, Pandnameh Adorbad Mehresfand etc.

Classical Persian the origin of this language is not very clear. Words have their roots in different languages spoken in various parts of the country but the majority of the words have their roots in Old Persian, Pahlavi and Avesta. They are represented in classical writings and poems. Ferdowsi claims to have gone through great pains for a period of thirty years to preserve this language, which was under pressure from the Arab invaders, and was on the verge of being lost.

It is noteworthy that every country that the Arabs conquered lost its civilisation, culture and language and adopted the Arabic language and way of life. For example Egypt whose people could build Pyramids, were good astronomers and possessed the art of mummification lost their culture and language to the Arabs and started living like them. It was only Iran that broke the trend and stood against the Arabs and preserved its culture and language and even adopted their own version of Islam by creating Shiaism.

Later when the Moguls invaded Iran the Iranians converted them into ambassadors of Iranian language, culture and art. The Moguls made Parsi their court language in India.

Modern Persian language or Farsi (Arabic pronunciation of Parsi) as spoken today consists of a lot of words of non-Iranian origin. Some modern technical terms, understandably, have been incorporated from English, French and German and are recognizable, but Arabic has corrupted a major part of the language by replacing original Parsi words.

Persian, also known as Farsi (based on the ancient dialect spoken in the south-western part of the country, in the province of Fars, as it is still known today), belongs to the Indo-European family, Iranian group and is spoken by nearly 33 million people, mainly in Iran. It is one of the world's oldest languages, a standard and well-recognized tongue as early as the 6th century B.C. Old Persian was the language of the great Persian Empire which at one time extended from the Mediterranean to the Indus River in India. The language was written in Cuneiform, the wedge-shaped characters used throughout much of the ancient world. In the 2nd century B.C. the Persians created their own alphabet, known as Pahlavi, which remained in use until the Islamic conquest of the 7th century. Since that time Persian has been written in the Arabic script with a number of additional characters to accommodate special sounds. Literary modern Farsi is virtually identical in Iran and Afghanistan, with very minor lexical differences. As far as dialects are concerned, one of them shades into Dari in Afghanistan, another into Tajik in Tajikistan. English words of Persian origin include shawl, pajama, taffeta, khaki, kiosk, divan, lilac, jasmine, julep, jackal, caravan, bazaar, checkmate, dervish, and satrap.