User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress

Repos

Current tasks

(Project Ended)

Progress

Progress table

Week		Stems			Coverage			WER,PER		Progress
№	dates	hin	pan	pan-hin	hin	pan	pan-hin	hin→pan	pan→hin	Evaluation	Notes
1	May 1-24	-	-	-	-	47%	-	-	-	-	Original coverage - 12.8%
2	May 24-31	-	+1400	-	-	52%	-	36.83%, 33.84%	40.03%, 36.93%	-	WER, PER On a set of 25 sentences(612 words)
3	June 1-7	-	+500	+250	-	58%	-	49.52%, 42.71%	48.65%, 44.03%	-	WER, PER On a set of 50 sentences(1038 words)
4	June 8-14	+50	+1500	+1000	-	63%	-	41.36%, 35.82%	41.47%, 37.03%	-	WER, PER On a set of 50 sentences(1038 words)
5	June 15-21	-	+50	+200	-	66%	52.8%	41.36%, 35.82%	41.47%, 37.03%	-	WER, PER On a set of 50 sentences(1038 words)
6	June 22-28	-	+350	+700	-	70%	59%	39.76%, 34.22%	40.03%, 36.23%	-	WER, PER On a set of 50 sentences(1038 words)
7	June 29-Jul 6	-	+350	+700	-	71.3%	61%	39.76%, 34.22%	40.03%, 36.23%	-	WER, PER On a set of 50 sentences(1038 words)
8	July 7-13	-	+1200	+1600	-	73%	63%	39.76%, 34.22%	40.03%, 36.23%	-	WER, PER On a set of 50 sentences(1038 words)
9	July 14-20	-	+2000	+2000	-	74%	65%	36.76%, 32.22%	39.03%, 33.53%	-	WER, PER On a set of 100 sentences(2140 words)
10	July 21-27	-	+2500	+2200	-	75%	67%	33.76%, 29.62%	35.03%, 31.23%	-	WER, PER On a set of 100 sentences(2140 words)
11	July 28-Aug 2	-	+1200	+1300	-	78%	70%	32.34%, 28.82%	34.45%, 30.43%	-	WER, PER On a set of 100 sentences(2140 words)
12	Aug 3-9	-	+1800	+1100	-	81%	70%	28.04%, 25.85%	30.15%, 29.73%	-	WER, PER On a set of 110 sentences(2140 words)
13	Aug 10-16	+200	+300	+/-/*5000	-	83%	80%	22.04%, 19.75%	26.15%, 23.13%	-	WER, PER On a set of 110 sentences(2140 words)

DONE

DONE - Added WX transliterations to bidix.
DONE - pan-hin translator coverage up to 80%.
DONE - Pan monodix coverage up to 85%
DONE - WER below 20% for hin-pan
DONE - WER below 25% for pan-hin
DONE - Making tagger unigram(@spectie).
DONE - Adding .lsx files and modes for multiwords (mostly with preposition).
DONE - Added bidix coverage script
DONE - Added Verbs.
DONE - Bidix updates.
DONE - Fixed errors with postposition transfer.
DONE - Added proper nouns.
DONE - Added noun paradigms.
DONE - Added Adverbs.
DONE - Added about 1400 adjective stems
DONE - Function words(cnj, det, prn, post, gen_endings), Coverage > 47%
DONE - Collected parallel texts to calculate WER,PER etc..
DONE - Added bidirectional dictionary(33k paradigms)
DONE - Fixed bidirectional translation i.e. pan->hin(gave close to human translation for small test set, even though similar transfer rules were copied)
DONE - Scraped all Wikipedia texts and made a combined frequency list.
DONE - Frequency lists using WikiExtractor on latest dump.

IN PROGRESS

(Project Ended : Check #Future_Work)

TODO

(Project Ended : Check #Future_Work)

Future Work

Conjunct verbs in the Hindi monodix exist in the form of multi-words which isn't (1) necessary (2) not a great way of implementing this. While most of these are handled even while translating to punjabi because the first verb in the conjunct takes it form from an already existing verb without the second verb (that is, if it can exist alone). However, this can have some challenges in the chunking stage where this conjunct verb is read as two verbs(in the case of Punjabi).

Solution : Use separable

Multiple paradigms, mostly in category <prn> use the "mf" in place of m-obj. This needs to be fixed in multiple places in the hindi, urdu dictionaries. For example : Number of subject-object matters for pronouns : ਮੇਰੀਆਂ | ਸਾਡੀ. मेरे(marked mf) | मेरा(marked m)
Tagger needs to be trained to pick verbs and post/pre positions preferred to adjectives, adverbs(this is not a regular trend and results are mostly decent)
In some slangs, postpositions are combined with words. Need to identify and these. For examples : Add post-positions to definitions - ਹਸਪਤਾਲੋਂ = ਹਸਪਤਾਲ ਤੋਂ.
What is the analysis for ਦੋਸਤੋਂ ("Friends, come with me). The problem will be that doston in Hindi is the translation for this and friends being used as in (his friends did this). Not a big problem, but needs analysis to idenify more such occurences.
Add transfer rule and check if tagger correctly identifies which one is being picked. ਮੈਂ is the translation for मैंने and मैं.
Fix alternate spellings analyses. Mostly multiple spellings have been added but there could be some easy automatic fixes which can be done. For example : ਉਸ੍ਸਰ > ਉੱਸਰ, ਖਿਲ੍ਲਰ > ਖਿੱਲਰ
कि v/s की punjabi ki can also be kya.
Fix for borrowed words(mostly english) and Proper nouns. Can work on transliteration once these NEs can be correctly identified.
Lexical selection rules need to be added. Could not be completed due to some errors in IRSTLM. For example, a simple rule : aisa <noun> - ajiha BUT aisa <verb> innj.

Literature(Apertium Wiki) Covered

DONE - Calculating Coverage.
DONE - A long introduction on Transfer Rules.
DONE - Transfer Rules examples
DONE - Wikipedia Dumps.
DONE - Generating Frequency lists.
DONE - Building Dictionaries#Monolingual.
DONE - Evaluation.
DONE - Extract.
DONE - Monodix Basics
DONE - Improved Corpus Based Paradigm Matching.
DONE - Transliteration.
DONE - Workflow reference.
DONE - Tagger Tranining.
DONE - Modes introduction.
DONE - Apertium-viewer.

User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress

Contents

Repos

Current tasks

Progress

Progress table

DONE

IN PROGRESS

TODO

Future Work

Literature(Apertium Wiki) Covered

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools