Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress"
Line 1: | Line 1: | ||
== |
==Summary== |
||
The goal of this project was to develop an analyzer for Punjabi along with translators apertium-hin-pan and apertium-pan-hin. Coverage for Punjabi data was calculated on Wikipedia data(using all articles from Punjabi Wikipedia covering about 3.6 million tokens). While very simple transfer rules are required for this pair, paradigms and the correct analyses of post positions and multi words was essential. The tagger was trained to unigram for this task along with the use of apertium-separable. More work needs to be done lexical selection rules and few more tasks(See #Future_Work). The results on Wikipedia articles were pretty decent, except in the cases where too many borrowed words or proper nouns existed in the article. This'll be a problem for the existing hin-urd pair as well. Adding so many spelling variations of proper nouns(transliterated to hindi) is a difficult task, so some other alternative needs to be explored. Overall, good results were achieved for Coverage(given that the Punjabi analyzer was picked up from a very nascent state. WER should improve once lexical selection is done on corpus. |
The goal of this project was to develop an analyzer for Punjabi along with translators apertium-hin-pan and apertium-pan-hin. Coverage for Punjabi data was calculated on Wikipedia data(using all articles from Punjabi Wikipedia covering about 3.6 million tokens). While very simple transfer rules are required for this pair, paradigms and the correct analyses of post positions and multi words was essential. The tagger was trained to unigram for this task along with the use of apertium-separable. More work needs to be done lexical selection rules and few more tasks(See #Future_Work). The results on Wikipedia articles were pretty decent, except in the cases where too many borrowed words or proper nouns existed in the article. This'll be a problem for the existing hin-urd pair as well. Adding so many spelling variations of proper nouns(transliterated to hindi) is a difficult task, so some other alternative needs to be explored. Overall, good results were achieved for Coverage(given that the Punjabi analyzer was picked up from a very nascent state. WER should improve once lexical selection is done on corpus. |
||
Revision as of 20:43, 29 August 2020
Contents
Summary
The goal of this project was to develop an analyzer for Punjabi along with translators apertium-hin-pan and apertium-pan-hin. Coverage for Punjabi data was calculated on Wikipedia data(using all articles from Punjabi Wikipedia covering about 3.6 million tokens). While very simple transfer rules are required for this pair, paradigms and the correct analyses of post positions and multi words was essential. The tagger was trained to unigram for this task along with the use of apertium-separable. More work needs to be done lexical selection rules and few more tasks(See #Future_Work). The results on Wikipedia articles were pretty decent, except in the cases where too many borrowed words or proper nouns existed in the article. This'll be a problem for the existing hin-urd pair as well. Adding so many spelling variations of proper nouns(transliterated to hindi) is a difficult task, so some other alternative needs to be explored. Overall, good results were achieved for Coverage(given that the Punjabi analyzer was picked up from a very nascent state. WER should improve once lexical selection is done on corpus.
Repos
- https://github.com/apertium/apertium-pan
- https://github.com/apertium/apertium-hin-pan
- https://github.com/apertium/apertium-hin
Current tasks
(Project Ended)
Progress
Progress table
Week | Stems | Coverage | WER,PER | Progress | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
№ | dates | hin | pan | pan-hin | hin | pan | pan-hin | hin→pan | pan→hin | Evaluation | Notes |
1 | May 1-24 | - | - | - | - | 47% | - | - | - | - | Original coverage - 12.8% |
2 | May 24-31 | - | +1400 | - | - | 52% | - | 36.83%, 33.84% | 40.03%, 36.93% | - | WER, PER On a set of 25 sentences(612 words) |
3 | June 1-7 | - | +500 | +250 | - | 58% | - | 49.52%, 42.71% | 48.65%, 44.03% | - | WER, PER On a set of 50 sentences(1038 words) |
4 | June 8-14 | +50 | +1500 | +1000 | - | 63% | - | 41.36%, 35.82% | 41.47%, 37.03% | - | WER, PER On a set of 50 sentences(1038 words) |
5 | June 15-21 | - | +50 | +200 | - | 66% | 52.8% | 41.36%, 35.82% | 41.47%, 37.03% | - | WER, PER On a set of 50 sentences(1038 words) |
6 | June 22-28 | - | +350 | +700 | - | 70% | 59% | 39.76%, 34.22% | 40.03%, 36.23% | - | WER, PER On a set of 50 sentences(1038 words) |
7 | June 29-Jul 6 | - | +350 | +700 | - | 71.3% | 61% | 39.76%, 34.22% | 40.03%, 36.23% | - | WER, PER On a set of 50 sentences(1038 words) |
8 | July 7-13 | - | +1200 | +1600 | - | 73% | 63% | 39.76%, 34.22% | 40.03%, 36.23% | - | WER, PER On a set of 50 sentences(1038 words) |
9 | July 14-20 | - | +2000 | +2000 | - | 74% | 65% | 36.76%, 32.22% | 39.03%, 33.53% | - | WER, PER On a set of 100 sentences(2140 words) |
10 | July 21-27 | - | +2500 | +2200 | - | 75% | 67% | 33.76%, 29.62% | 35.03%, 31.23% | - | WER, PER On a set of 100 sentences(2140 words) |
11 | July 28-Aug 2 | - | +1200 | +1300 | - | 78% | 70% | 32.34%, 28.82% | 34.45%, 30.43% | - | WER, PER On a set of 100 sentences(2140 words) |
12 | Aug 3-9 | - | +1800 | +1100 | - | 81% | 70% | 28.04%, 25.85% | 30.15%, 29.73% | - | WER, PER On a set of 110 sentences(2140 words) |
13 | Aug 10-16 | +200 | +300 | +/-/*5000 | - | 83% | 80% | 22.04%, 19.75% | 26.15%, 23.13% | - | WER, PER On a set of 110 sentences(2140 words) |
DONE
- DONE - Added WX transliterations to bidix.
- DONE - pan-hin translator coverage up to 80%.
- DONE - Pan monodix coverage up to 85%
- DONE - WER below 20% for hin-pan
- DONE - WER below 25% for pan-hin
- DONE - Making tagger unigram(@spectie).
- DONE - Adding .lsx files and modes for multiwords (mostly with preposition).
- DONE - Added bidix coverage script
- DONE - Added Verbs.
- DONE - Bidix updates.
- DONE - Fixed errors with postposition transfer.
- DONE - Added proper nouns.
- DONE - Added noun paradigms.
- DONE - Added Adverbs.
- DONE - Added about 1400 adjective stems
- DONE - Function words(cnj, det, prn, post, gen_endings), Coverage > 47%
- DONE - Collected parallel texts to calculate WER,PER etc..
- DONE - Added bidirectional dictionary(33k paradigms)
- DONE - Fixed bidirectional translation i.e. pan->hin(gave close to human translation for small test set, even though similar transfer rules were copied)
- DONE - Scraped all Wikipedia texts and made a combined frequency list.
- DONE - Frequency lists using WikiExtractor on latest dump.
IN PROGRESS
(Project Ended : Check #Future_Work)
TODO
(Project Ended : Check #Future_Work)
Future Work
- Conjunct verbs in the Hindi monodix exist in the form of multi-words which isn't (1) necessary (2) not a great way of implementing this. While most of these are handled even while translating to punjabi because the first verb in the conjunct takes it form from an already existing verb without the second verb (that is, if it can exist alone). However, this can have some challenges in the chunking stage where this conjunct verb is read as two verbs(in the case of Punjabi).
Solution : Use separable
- Multiple paradigms, mostly in category <prn> use the "mf" in place of m-obj. This needs to be fixed in multiple places in the hindi, urdu dictionaries. For example : Number of subject-object matters for pronouns : ਮੇਰੀਆਂ | ਸਾਡੀ. मेरे(marked mf) | मेरा(marked m)
- Tagger needs to be trained to pick verbs and post/pre positions preferred to adjectives, adverbs(this is not a regular trend and results are mostly decent)
- In some slangs, postpositions are combined with words. Need to identify and these. For examples : Add post-positions to definitions - ਹਸਪਤਾਲੋਂ = ਹਸਪਤਾਲ ਤੋਂ.
- What is the analysis for ਦੋਸਤੋਂ ("Friends, come with me). The problem will be that doston in Hindi is the translation for this and friends being used as in (his friends did this). Not a big problem, but needs analysis to idenify more such occurences.
- Add transfer rule and check if tagger correctly identifies which one is being picked. ਮੈਂ is the translation for मैंने and मैं.
- Fix alternate spellings analyses. Mostly multiple spellings have been added but there could be some easy automatic fixes which can be done. For example : ਉਸ੍ਸਰ > ਉੱਸਰ, ਖਿਲ੍ਲਰ > ਖਿੱਲਰ
- कि v/s की punjabi ki can also be kya.
- Fix for borrowed words(mostly english) and Proper nouns. Can work on transliteration once these NEs can be correctly identified.
- Lexical selection rules need to be added. Could not be completed due to some errors in IRSTLM. For example, a simple rule : aisa <noun> - ajiha BUT aisa <verb> innj.
Literature(Apertium Wiki) Covered
- DONE - Calculating Coverage.
- DONE - A long introduction on Transfer Rules.
- DONE - Transfer Rules examples
- DONE - Wikipedia Dumps.
- DONE - Generating Frequency lists.
- DONE - Building Dictionaries#Monolingual.
- DONE - Evaluation.
- DONE - Extract.
- DONE - Monodix Basics
- DONE - Improved Corpus Based Paradigm Matching.
- DONE - Transliteration.
- DONE - Workflow reference.
- DONE - Tagger Tranining.
- DONE - Modes introduction.
- DONE - Apertium-viewer.