User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress
The goal of this project was to develop an analyzer for Punjabi along with translators apertium-hin-pan and apertium-pan-hin. Coverage for Punjabi data was calculated on Wikipedia data(using all articles from Punjabi Wikipedia covering about 3.6 million tokens). While very simple transfer rules are required for this pair, paradigms and the correct analyses of post positions and multi words was essential. The tagger was trained to unigram for this task along with the use of apertium-separable. More work needs to be done lexical selection rules and few more tasks(See #Future_Work). The results on Wikipedia articles were pretty decent, except in the cases where too many borrowed words or proper nouns existed in the article. This'll be a problem for the existing hin-urd pair as well. Adding so many spelling variations of proper nouns(transliterated to hindi) is a difficult task, so some other alternative needs to be explored. Overall, good results were achieved for Coverage(given that the Punjabi analyzer was picked up from a very nascent state. WER should improve once lexical selection is done on corpus.
Most code was finally merged and can be found here : https://apertium.projectjj.com/gsoc2020/priyankmodiPM.html . Thanks to Tino :)
The major accomplishment of this project has been in the Punjabi monodix which reached 85% Coverage(with over 12,000 stems) on a total of 3.6 million tokens(entire Wikipedia corpus on Punjabi). In addition to new entries (added from frequency lists and crossdics), entries broken due to tag changes have been fixed, and the dictionary has been reorganized and cleaned. Additions have also been made to the Hindi monodix, also changing paradigm in cases handled by separable.
The bidix also has close to 17,500 translations and the pan-hin translator coverage is also just above 80% computed on the same corpus. Weights have been assigned to the entries as well because of multiple spelling options for a translation. This will be taken up in future work to automatically assign the correct spelling. The hin-pan translator also has decent coverage(~100% for post/prepositions, pronouns, proper nouns, adverbs, adjectives, intransitive verbs. >80% in case of adjectives. ~90% in case of transitive verbs. Most nouns high up/midway on the zipf curve have been added, but there's still quite a few with lower frequencies that remain to be added).
The pair still needs to go through testvoc iterations as there exist some verbs which existed in the bidix before the project. Most of these have been removed and Proper nouns, adverbs, adjectives and almost all nouns have been checked.
Thanks to the Lsx module, Apertium can efficiently handle multiwords even in cases one of the word is a suffix in some other word. Although, once IRSTLM is fixed, more examples can be analyzed which can benefit from the use of separable. @tanmai_khanna and @spectie were a huge help in guiding me through this part of the project :)
The Hindi and Punjabi dictionaries were originally using taggers(.prob files) directly copied from the English module. @spectie helped me here to train a newer version and also making the taggers unigram.
Hindi and Punjabi are fairly even when it comes to syntax, so rules from the existing Urdu-Hindi module were mostly sufficient for this pair as well.
The translator performs good (tested on Wikipedia articles and stories taken from indic websites) as long as there aren't a lot of borrowed words or proper nouns with spelling variations. This remains to be challenge especially in the case of technical Wikipedia articles. On a generic Wiki article about the state of Pakistan, the translator achieved a WER of 10.84%.
|1||May 1-24||-||-||-||-||47%||-||-||-||-||Original coverage - 12.8%|
|2||May 24-31||-||+1400||-||-||52%||-||36.83%, 33.84%||40.03%, 36.93%||-||WER, PER On a set of 25 sentences(612 words)|
|3||June 1-7||-||+500||+250||-||58%||-||49.52%, 42.71%||48.65%, 44.03%||-||WER, PER On a set of 50 sentences(1038 words)|
|4||June 8-14||+50||+1500||+1000||-||63%||-||41.36%, 35.82%||41.47%, 37.03%||-||WER, PER On a set of 50 sentences(1038 words)|
|5||June 15-21||-||+50||+200||-||66%||52.8%||41.36%, 35.82%||41.47%, 37.03%||-||WER, PER On a set of 50 sentences(1038 words)|
|6||June 22-28||-||+350||+700||-||70%||59%||39.76%, 34.22%||40.03%, 36.23%||-||WER, PER On a set of 50 sentences(1038 words)|
|7||June 29-Jul 6||-||+350||+700||-||71.3%||61%||39.76%, 34.22%||40.03%, 36.23%||-||WER, PER On a set of 50 sentences(1038 words)|
|8||July 7-13||-||+1200||+1600||-||73%||63%||39.76%, 34.22%||40.03%, 36.23%||-||WER, PER On a set of 50 sentences(1038 words)|
|9||July 14-20||-||+2000||+2000||-||74%||65%||36.76%, 32.22%||39.03%, 33.53%||-||WER, PER On a set of 100 sentences(2140 words)|
|10||July 21-27||-||+2500||+2200||-||75%||67%||33.76%, 29.62%||35.03%, 31.23%||-||WER, PER On a set of 100 sentences(2140 words)|
|11||July 28-Aug 2||-||+1200||+1300||-||78%||70%||32.34%, 28.82%||34.45%, 30.43%||-||WER, PER On a set of 100 sentences(2140 words)|
|12||Aug 3-9||-||+1800||+1100||-||81%||70%||28.04%, 25.85%||30.15%, 29.73%||-||WER, PER On a set of 110 sentences(2140 words)|
|13||Aug 10-16||+200||+300||+/-/*5000||-||83%||80%||22.04%, 19.75%||26.15%, 23.13%||-||WER, PER On a set of 110 sentences(2140 words)|
|14||Aug 17-24||+100||+600||+300||-||85%||81%||8.4%, 6.78%||26.44%, 22.31%||-||WER, PER On a set of 180 sentences(3200 words)|
- DONE - Added WX transliterations to bidix.
- DONE - pan-hin translator coverage up to 80%.
- DONE - Pan monodix coverage up to 85%
- DONE - WER below 20% for hin-pan
- DONE - WER below 25% for pan-hin
- DONE - Making tagger unigram(@spectie).
- DONE - Adding .lsx files and modes for multiwords (mostly with preposition).
- DONE - Added bidix coverage script
- DONE - Added Verbs.
- DONE - Bidix updates.
- DONE - Fixed errors with postposition transfer.
- DONE - Added proper nouns.
- DONE - Added noun paradigms.
- DONE - Added Adverbs.
- DONE - Added about 1400 adjective stems
- DONE - Function words(cnj, det, prn, post, gen_endings), Coverage > 47%
- DONE - Collected parallel texts to calculate WER,PER etc..
- DONE - Added bidirectional dictionary(33k paradigms)
- DONE - Fixed bidirectional translation i.e. pan->hin(gave close to human translation for small test set, even though similar transfer rules were copied)
- DONE - Scraped all Wikipedia texts and made a combined frequency list.
- DONE - Frequency lists using WikiExtractor on latest dump.
(Project Ended : Check #Future_Work)
(Project Ended : Check #Future_Work)
- Conjunct verbs in the Hindi monodix exist in the form of multi-words which isn't (1) necessary (2) not a great way of implementing this. While most of these are handled even while translating to punjabi because the first verb in the conjunct takes it form from an already existing verb without the second verb (that is, if it can exist alone). However, this can have some challenges in the chunking stage where this conjunct verb is read as two verbs(in the case of Punjabi).
Solution : Use separable
- Multiple paradigms, mostly in category <prn> use the "mf" in place of m-obj. This needs to be fixed in multiple places in the hindi, urdu dictionaries. For example : Number of subject-object matters for pronouns : ਮੇਰੀਆਂ | ਸਾਡੀ. मेरे(marked mf) | मेरा(marked m)
- Tagger needs to be trained to pick verbs and post/pre positions preferred to adjectives, adverbs(this is not a regular trend and results are mostly decent)
- In some slangs, post positions are combined with words. Need to identify these. For examples : Add post-positions to definitions - ਹਸਪਤਾਲੋਂ = ਹਸਪਤਾਲ ਤੋਂ. It's a very irregular trend though and usually not used in written Punjabi.
- What is the analysis for ਦੋਸਤੋਂ ("Friends, come with me). The problem will be that doston in Hindi is the translation for this and friends being used as in (his friends did this). Not a big problem, but needs analysis to identify more such occurrences.
- Add transfer rule and check if tagger correctly identifies which one is being picked. ਮੈਂ is the translation for मैंने and मैं.
- Fix alternate spellings analyses. Mostly multiple spellings have been added but there could be some easy automatic fixes which can be done. For example : ਉਸ੍ਸਰ > ਉੱਸਰ, ਖਿਲ੍ਲਰ > ਖਿੱਲਰ
- Fix for borrowed words(mostly English) and Proper nouns. Can work on transliteration once these NEs can be correctly identified.
- Lexical selection rules need to be added. Could not be completed due to some errors in IRSTLM. For example, a simple rule : aisa <noun> - ajiha BUT aisa <verb> innj.
- Meeting testvoc requirements before release.
Experience and Final words
Working with Apertium over the past three months has been nothing less than amazing. I don't think I ever had a problem understanding the pipe(the limited parts of it I used) because of the extensive documentation that exists. The community is probably the most helpful one I've ever been part of. Would love to see this pair out soon, after ofcourse a good analysis of my work.
Literature(Apertium Wiki) Covered
- DONE - Calculating Coverage.
- DONE - A long introduction on Transfer Rules.
- DONE - Transfer Rules examples
- DONE - Wikipedia Dumps.
- DONE - Generating Frequency lists.
- DONE - Building Dictionaries#Monolingual.
- DONE - Evaluation.
- DONE - Extract.
- DONE - Monodix Basics
- DONE - Improved Corpus Based Paradigm Matching.
- DONE - Transliteration.
- DONE - Workflow reference.
- DONE - Tagger Tranining.
- DONE - Modes introduction.
- DONE - Apertium-viewer.