Difference between revisions of "Hindi and Bengali"

From Apertium
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:


=Hindi and Bengali for GSoC=
=Hindi and Bengali for GSoC '21=

This is a language pair translating between [[Hindi]] and [[Bengali]]. The project involved developing the Hindi-Bengali language pair in both directions i.e. ben-hin and hin-ben. The work involved building two dictionaries from an existing open-source project with very minimal work done, i.e., the Bengali monolingual dictionary and the Bengali-Hindi bilingual dictionary. Although it was not anticipated, several errors were found in the Hindi paradigms. So, the Hindi monolingual dictionary was modified. The Bengali dictionary was restructured too to match the Hindi dictionary.

The work was divided over 11 weeks. The work on the Bengali and Bengali-Hindi dictionary began with working on the closed categories and later words were added according to the frequency. The work on the Hindi dictionary began after the 6th week, and several paradigms were corrected (missing tags were added, removed and reordered, and several paradigms were marked as deprecated).

==Current Status==

* Currently there are 7078 words excluding proper names in the monolingual dictionary and 1718 words excluding proper names in the bilingual dictionary.
* Current coverage of Hin-Ben translator is ~67.8% and Ben-Hin translator is ~49.7%.
* The Bengali monolingual dictionary coverage is ~72.0%.
* [https://wiki.apertium.org/wiki/User:Gourab337/GSoC2021-Workplan-Control Workplan for GSOC '21 Ben-Hin]

==Goals==

Currently the translator is very basic. We need to increase it's coverage to cover more words of the languages. We also need to add more transfer rules to cover all the [https://wiki.apertium.org/wiki/Hindi_and_Bengali/Pending-Tests Pending Tests] to get more accurate translations.


==Done==
==Done==
* Closed Categories (n, adj, vblex, vbser, adv, prn, post, cnjcoo, cnjsub, cnjadv, det, num, prn, ord).
* <s>Closed Categories (n, adj, vblex, vbser, adv, prn, post, cnjcoo, cnjsub, cnjadv, det, num, prn, ord).</s>
* nouns, post, adj, adv, det from hitparade list.
* <s>Most frequently used nouns, post, adj, adv, det added.</s>
* Hin > Ben transfer rules on nouns, verbs tenses and adj.
* <s>Hin > Ben transfer rules on nouns, verbs tenses and adj added.</s>
* Testing scripts and test corpus.
* <s>Testing scripts and test corpus.</s>


==Todo list==
==Todo list==
* Add more words for nouns, adjectives and verbs from hitparade list.
* Increase coverage of translator by adding more nouns, adjectives and verbs from the list of most frequently used words in corpus. [https://wiki.apertium.org/wiki/Building_dictionaries Reference]
* Add transfer rules to fix pronoun #s (obj -> obl , nom -> nom, erg conversion)
* Add transfer rules to fix pronoun #s (obj -> obl , nom -> nom, erg conversion).
* Transfer rules for [https://wiki.apertium.org/wiki/Hindi_and_Bengali/Pending-Tests Pending Tests for Apertium-ben-hin] (Ben > Hin and Hin > Ben).
* Write transfer rules for [https://wiki.apertium.org/wiki/Hindi_and_Bengali/Pending-Tests Pending Tests] (Ben > Hin and Hin > Ben).
* Add more rules in the pending tests.
* Lift prox and dist tag via making a suitable paradigm for det (ইটা / ওটা)
* Remove prox and dist tag in the bidix and replace it by making suitable paradigms for det.prox & det.dist (ইটা / ওটা).
* Working on lexical selections.
* Morphological disambiguation of Hindi sentences for Hindi-Bengali translation.


==Apertium Git Repositories==
==Apertium Git Repositories==
Line 18: Line 36:
*[https://github.com/apertium/apertium-hin apertium-hin]
*[https://github.com/apertium/apertium-hin apertium-hin]
*[https://github.com/apertium/apertium-ben apertium-ben]
*[https://github.com/apertium/apertium-ben apertium-ben]
*[https://github.com/apertium/apertium-eng-hin apertium-eng-hin]


==External Resources==
==External Resources==
Line 49: Line 66:
* [[Bengali]]
* [[Bengali]]
* [[Hindi]]
* [[Hindi]]
* [[Hindi and English]]


[[Category:Hindi and Bengali]]
[[Category:Hindi and Bengali]]

Latest revision as of 05:45, 25 August 2021

Hindi and Bengali for GSoC '21[edit]

This is a language pair translating between Hindi and Bengali. The project involved developing the Hindi-Bengali language pair in both directions i.e. ben-hin and hin-ben. The work involved building two dictionaries from an existing open-source project with very minimal work done, i.e., the Bengali monolingual dictionary and the Bengali-Hindi bilingual dictionary. Although it was not anticipated, several errors were found in the Hindi paradigms. So, the Hindi monolingual dictionary was modified. The Bengali dictionary was restructured too to match the Hindi dictionary.

The work was divided over 11 weeks. The work on the Bengali and Bengali-Hindi dictionary began with working on the closed categories and later words were added according to the frequency. The work on the Hindi dictionary began after the 6th week, and several paradigms were corrected (missing tags were added, removed and reordered, and several paradigms were marked as deprecated).

Current Status[edit]

  • Currently there are 7078 words excluding proper names in the monolingual dictionary and 1718 words excluding proper names in the bilingual dictionary.
  • Current coverage of Hin-Ben translator is ~67.8% and Ben-Hin translator is ~49.7%.
  • The Bengali monolingual dictionary coverage is ~72.0%.
  • Workplan for GSOC '21 Ben-Hin

Goals[edit]

Currently the translator is very basic. We need to increase it's coverage to cover more words of the languages. We also need to add more transfer rules to cover all the Pending Tests to get more accurate translations.

Done[edit]

  • Closed Categories (n, adj, vblex, vbser, adv, prn, post, cnjcoo, cnjsub, cnjadv, det, num, prn, ord).
  • Most frequently used nouns, post, adj, adv, det added.
  • Hin > Ben transfer rules on nouns, verbs tenses and adj added.
  • Testing scripts and test corpus.

Todo list[edit]

  • Increase coverage of translator by adding more nouns, adjectives and verbs from the list of most frequently used words in corpus. Reference
  • Add transfer rules to fix pronoun #s (obj -> obl , nom -> nom, erg conversion).
  • Write transfer rules for Pending Tests (Ben > Hin and Hin > Ben).
  • Add more rules in the pending tests.
  • Remove prox and dist tag in the bidix and replace it by making suitable paradigms for det.prox & det.dist (ইটা / ওটা).
  • Working on lexical selections.
  • Morphological disambiguation of Hindi sentences for Hindi-Bengali translation.

Apertium Git Repositories[edit]

External Resources[edit]

General[edit]

Dictionaries[edit]

Corpora[edit]


See also[edit]