Hindi and Bengali

From Apertium
Jump to navigation Jump to search

Hindi and Bengali for GSoC '21[edit]

This is a language pair translating between Hindi and Bengali. The project involved developing the Hindi-Bengali language pair in both directions i.e. ben-hin and hin-ben. The work involved building two dictionaries from an existing open-source project with very minimal work done, i.e., the Bengali monolingual dictionary and the Bengali-Hindi bilingual dictionary. Although it was not anticipated, several errors were found in the Hindi paradigms. So, the Hindi monolingual dictionary was modified. The Bengali dictionary was restructured too to match the Hindi dictionary.

The work was divided over 11 weeks. The work on the Bengali and Bengali-Hindi dictionary began with working on the closed categories and later words were added according to the frequency. The work on the Hindi dictionary began after the 6th week, and several paradigms were corrected (missing tags were added, removed and reordered, and several paradigms were marked as deprecated).

Current Status[edit]

  • Currently there are 7078 words excluding proper names in the monolingual dictionary and 1718 words excluding proper names in the bilingual dictionary.
  • Current coverage of Hin-Ben translator is ~67.8% and Ben-Hin translator is ~49.7%.
  • The Bengali monolingual dictionary coverage is ~72.0%.
  • Workplan for GSOC '21 Ben-Hin

Goals[edit]

Currently the translator is very basic. We need to increase it's coverage to cover more words of the languages. We also need to add more transfer rules to cover all the Pending Tests to get more accurate translations.

Done[edit]

  • Closed Categories (n, adj, vblex, vbser, adv, prn, post, cnjcoo, cnjsub, cnjadv, det, num, prn, ord).
  • Most frequently used nouns, post, adj, adv, det added.
  • Hin > Ben transfer rules on nouns, verbs tenses and adj added.
  • Testing scripts and test corpus.

Todo list[edit]

  • Increase coverage of translator by adding more nouns, adjectives and verbs from the list of most frequently used words in corpus. Reference
  • Add transfer rules to fix pronoun #s (obj -> obl , nom -> nom, erg conversion).
  • Write transfer rules for Pending Tests (Ben > Hin and Hin > Ben).
  • Add more rules in the pending tests.
  • Remove prox and dist tag in the bidix and replace it by making suitable paradigms for det.prox & det.dist (ইটা / ওটা).
  • Working on lexical selections.
  • Morphological disambiguation of Hindi sentences for Hindi-Bengali translation.

Apertium Git Repositories[edit]

External Resources[edit]

General[edit]

Dictionaries[edit]

Corpora[edit]


See also[edit]