Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi/progress"
(Created page with "Progress on Automatic_blank_handling ==Current task== ===lttoolbox=== * Make lt-proc correctly disperse inline bl...") |
(→Repos) |
||
(88 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==Summary== |
|||
Progress on [[Ideas_for_Google_Summer_of_Code/Automatic_blank_handling|Automatic_blank_handling]] |
|||
The goal of this project was to develop an analyzer for Punjabi along with translators apertium-hin-pan and apertium-pan-hin. Coverage for Punjabi data was calculated on Wikipedia data(using all articles from Punjabi Wikipedia covering about 3.6 million tokens). While very simple transfer rules are required for this pair, paradigms and the correct analyses of post positions and multi words was essential. The tagger was trained to unigram for this task along with the use of apertium-separable. More work needs to be done lexical selection rules and few more tasks(See #Future_Work). The results on Wikipedia articles were pretty decent, except in the cases where too many borrowed words or proper nouns existed in the article. This'll be a problem for the existing hin-urd pair as well. Adding so many spelling variations of proper nouns(transliterated to hindi) is a difficult task, so some other alternative needs to be explored. Overall, good results were achieved for Coverage(given that the Punjabi analyzer was picked up from a very nascent state. WER should improve once lexical selection is done on corpus. |
|||
== |
==Repos== |
||
* https://github.com/apertium/apertium-pan |
|||
===lttoolbox=== |
|||
* https://github.com/apertium/apertium-hin-pan |
|||
* Make lt-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code> |
|||
* https://github.com/apertium/apertium-hin |
|||
* Task: Create a pull request to https://github.com/unhammer/lttoolbox/ with tests in https://github.com/unhammer/lttoolbox/tree/master/tests/lt_proc |
|||
Most code was finally merged and can be found here : https://apertium.projectjj.com/gsoc2020/priyankmodiPM.html . Thanks to Tino :) |
|||
== |
==Main Work== |
||
===hfst=== |
|||
* Make hfst-proc correctly disperse inline blanks onto each lexical unit until the next <code><nowiki>[</nowiki></code> |
|||
* Task: Create a pull request to https://github.com/hfst/hfst/ with tests in https://github.com/hfst/hfst/tree/master/test/tools/ |
|||
=== |
===Dictionaries=== |
||
The major accomplishment of this project has been in the Punjabi monodix which reached 85% Coverage(with over 12,000 stems) on a total of 3.6 million tokens(entire Wikipedia corpus on Punjabi). In addition to new entries (added from frequency lists and crossdics), entries broken due to tag changes have been fixed, and the dictionary has been reorganized and cleaned. Additions have also been made to the Hindi monodix, also changing paradigm in cases handled by separable. |
|||
* Test if current transfer.cc handles non-chunking/single-stage transfer correctly, if not, fix |
|||
* Task: PR to https://github.com/unhammer/apertium/ with tests showing working transfer.cc for single-stage/non-chunking transfer, with inline vs block-level blank handling and test that rules using misnumbered/missing b-elements should not mess up formatting |
|||
The bidix also has close to 17,500 translations and the pan-hin translator coverage is also just above 80% computed on the same corpus. Weights have been assigned to the entries as well because of multiple spelling options for a translation. This will be taken up in future work to automatically assign the correct spelling. The hin-pan translator also has decent coverage(~100% for post/prepositions, pronouns, proper nouns, adverbs, adjectives, intransitive verbs. >80% in case of adjectives. ~90% in case of transitive verbs. Most nouns high up/midway on the zipf curve have been added, but there's still quite a few with lower frequencies that remain to be added). |
|||
===postchunk=== |
|||
(Should be done after interchunk is complete) |
|||
The pair still needs to go through testvoc iterations as there exist some verbs which existed in the bidix before the project. Most of these have been removed and Proper nouns, adverbs, adjectives and almost all nouns have been checked. |
|||
* Task: PR to https://github.com/unhammer/apertium/ including tests showing working postchunk blank handling – test that rules using misnumbered/missing b-elements should not mess up formatting |
|||
=== |
===Separable=== |
||
Thanks to the Lsx module, Apertium can efficiently handle multiwords even in cases one of the word is a suffix in some other word. Although, once IRSTLM is fixed, more examples can be analyzed which can benefit from the use of separable. @tanmai_khanna and @spectie were a huge help in guiding me through this part of the project :) |
|||
* Ensure all other modules are fine with the new format for inline blanks (e.g. cg-proc) |
|||
* Work on other deformatters (mediawiki? latex?) |
|||
== |
===Tagger=== |
||
The Hindi and Punjabi dictionaries were originally using taggers(.prob files) directly copied from the English module. @spectie helped me here to train a newer version and also making the taggers unigram. |
|||
(Some of these are from coding challenges) |
|||
=== |
===Transfer Rules=== |
||
Hindi and Punjabi are fairly even when it comes to syntax, so rules from the existing Urdu-Hindi module were mostly sufficient for this pair as well. |
|||
# Make the HTML format handler <code>apertium-deshtml</code> turn "<i>foo <b>bar</b></i>" into "<nowiki>[{<i>}]foo [{<i><b>}]bar</nowiki>" |
|||
#* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-1.cpp |
|||
#* make <code>apertium-deshtml</code> *not* wrap tags like {{tag|p}} or {{tag|div}} in <code>{}</code> (ie. only for inline tags) |
|||
#* Code at https://github.com/SilentFlame/apertium/blob/master/challenge-2.cpp |
|||
=== |
===Quality=== |
||
The translator performs good (tested on Wikipedia articles and stories taken from indic websites) as long as there aren't a lot of borrowed words or proper nouns with spelling variations. This remains to be challenge especially in the case of technical Wikipedia articles. On a generic Wiki article about the state of Pakistan, the translator achieved a WER of 10.84%. |
|||
* Make pretransfer disperse tags when splitting lexical units https://github.com/unhammer/apertium/commit/39bd7d9fa45c64586d3a9b0f1a7df89e7d007c1a , code cleanup: |
|||
#* Fork https://github.com/unhammer/apertium and check out and compile the <code>master</code> branch |
|||
#* then in a different folder, do <code>git clone -b blank-handling https://github.com/junaidiiith/apertium</code> |
|||
#* from junaidiiith/blank-handling, copy over the changes that were made there to apertium_pretransfer.cc into your fork of unhammer/apertium, along with the pretransfer tests |
|||
#* ensure tests pass |
|||
#* PR at https://github.com/unhammer/apertium/pull/4 |
|||
== |
==Current tasks== |
||
<i>(Project Ended)</i> |
|||
# Fix a memory bug |
|||
#* uncommenting apertium/transfer.cc:1259 <code> // delete[] format;</code> in the blank handling branch leads to a double-free – find out why and ensure we're correctly releasing memory |
|||
#* Install valgrind from your package manager or http://valgrind.org/, then compile your program with -O0 -g3, then run <code>valgrind -v --leak-check=full apertium/apertium-transfer</code> and read the output |
|||
== |
== Progress == |
||
=== Progress table === |
|||
Interchunk needs to ignore the "pos" argument to b elements, and output each superblank exactly once, preferably where the rule has a b element (if there are not enough b's, output the rest at the end of the rule). |
|||
Interchunk shouldn't have to deal with wordblanks, since we can't look inside chunks when in interchunk. |
|||
{|class=wikitable |
|||
# Apply changes to transfer.cc to interchunk.cc |
|||
|- |
|||
#* Check <code>git clone -b blank-handling https://github.com/unhammer/apertium</code> |
|||
!colspan="2"|Week |
|||
#* Apply the <code>git diff 4c7c4f8f1b..2025182991</code> from transfer.cc to interchunk.cc |
|||
!colspan="3"|Stems |
|||
#* Try to make it compile and run – report things that didn't seem to have a 1-1 correspondence |
|||
!colspan="3"|Coverage |
|||
#* Write tests for interchunk, like those for transfer at https://github.com/unhammer/apertium/tree/blank-handling/tests |
|||
!colspan="2"|WER,PER |
|||
!colspan="2"|Progress |
|||
|- |
|||
! № |
|||
! dates |
|||
! hin |
|||
! pan |
|||
! pan-hin |
|||
! hin |
|||
! pan |
|||
! pan-hin |
|||
! hin→pan |
|||
! pan→hin |
|||
!Evaluation |
|||
!Notes |
|||
|- |
|||
! 1 |
|||
! May 1-24 |
|||
! - |
|||
! - |
|||
! - |
|||
! - |
|||
! 47% |
|||
! - |
|||
! - |
|||
! - |
|||
! - |
|||
! Original coverage - 12.8% |
|||
|- |
|||
! 2 |
|||
! May 24-31 |
|||
! - |
|||
! +1400 |
|||
! - |
|||
! - |
|||
! 52% |
|||
! - |
|||
! 36.83%, 33.84% |
|||
! 40.03%, 36.93% |
|||
! - |
|||
! WER, PER On a set of 25 sentences(612 words) |
|||
|- |
|||
! 3 |
|||
! June 1-7 |
|||
! - |
|||
! +500 |
|||
! +250 |
|||
! - |
|||
! 58% |
|||
! - |
|||
! 49.52%, 42.71% |
|||
! 48.65%, 44.03% |
|||
! - |
|||
! WER, PER On a set of 50 sentences(1038 words) |
|||
|- |
|||
! 4 |
|||
! June 8-14 |
|||
! +50 |
|||
! +1500 |
|||
! +1000 |
|||
! - |
|||
! 63% |
|||
! - |
|||
! 41.36%, 35.82% |
|||
! 41.47%, 37.03% |
|||
! - |
|||
! WER, PER On a set of 50 sentences(1038 words) |
|||
|- |
|||
! 5 |
|||
! June 15-21 |
|||
! - |
|||
! +50 |
|||
! +200 |
|||
! - |
|||
! 66% |
|||
! 52.8% |
|||
! 41.36%, 35.82% |
|||
! 41.47%, 37.03% |
|||
! - |
|||
! WER, PER On a set of 50 sentences(1038 words) |
|||
|- |
|||
! 6 |
|||
! June 22-28 |
|||
! - |
|||
! +350 |
|||
! +700 |
|||
! - |
|||
! 70% |
|||
! 59% |
|||
! 39.76%, 34.22% |
|||
! 40.03%, 36.23% |
|||
! - |
|||
! WER, PER On a set of 50 sentences(1038 words) |
|||
|- |
|||
! 7 |
|||
! June 29-Jul 6 |
|||
! - |
|||
! +350 |
|||
! +700 |
|||
! - |
|||
! 71.3% |
|||
! 61% |
|||
! 39.76%, 34.22% |
|||
! 40.03%, 36.23% |
|||
! - |
|||
! WER, PER On a set of 50 sentences(1038 words) |
|||
|- |
|||
! 8 |
|||
! July 7-13 |
|||
! - |
|||
! +1200 |
|||
! +1600 |
|||
! - |
|||
! 73% |
|||
! 63% |
|||
! 39.76%, 34.22% |
|||
! 40.03%, 36.23% |
|||
! - |
|||
! WER, PER On a set of 50 sentences(1038 words) |
|||
|- |
|||
! 9 |
|||
! July 14-20 |
|||
! - |
|||
! +2000 |
|||
! +2000 |
|||
! - |
|||
! 74% |
|||
! 65% |
|||
! 36.76%, 32.22% |
|||
! 39.03%, 33.53% |
|||
! - |
|||
! WER, PER On a set of 100 sentences(2140 words) |
|||
|- |
|||
! 10 |
|||
! July 21-27 |
|||
! - |
|||
! +2500 |
|||
! +2200 |
|||
! - |
|||
! 75% |
|||
! 67% |
|||
! 33.76%, 29.62% |
|||
! 35.03%, 31.23% |
|||
! - |
|||
! WER, PER On a set of 100 sentences(2140 words) |
|||
|- |
|||
! 11 |
|||
! July 28-Aug 2 |
|||
! - |
|||
! +1200 |
|||
! +1300 |
|||
! - |
|||
! 78% |
|||
! 70% |
|||
! 32.34%, 28.82% |
|||
! 34.45%, 30.43% |
|||
! - |
|||
! WER, PER On a set of 100 sentences(2140 words) |
|||
|- |
|||
! 12 |
|||
! Aug 3-9 |
|||
! - |
|||
! +1800 |
|||
! +1100 |
|||
! - |
|||
! 81% |
|||
! 70% |
|||
! 28.04%, 25.85% |
|||
! 30.15%, 29.73% |
|||
! - |
|||
! WER, PER On a set of 110 sentences(2140 words) |
|||
|- |
|||
! 13 |
|||
! Aug 10-16 |
|||
! +200 |
|||
! +300 |
|||
! +/-/*5000 |
|||
! - |
|||
! 83% |
|||
! 80% |
|||
! 22.04%, 19.75% |
|||
! 26.15%, 23.13% |
|||
! - |
|||
! WER, PER On a set of 110 sentences(2140 words) |
|||
|- |
|||
! 14 |
|||
! Aug 17-24 |
|||
! +100 |
|||
! +600 |
|||
! +300 |
|||
! - |
|||
! 85% |
|||
! 81% |
|||
! 8.4%, 6.78% |
|||
! 26.44%, 22.31% |
|||
! - |
|||
! WER, PER On a set of 180 sentences(3200 words) |
|||
|} |
|||
===DONE=== |
|||
* DONE - Added WX transliterations to bidix. |
|||
* DONE - pan-hin translator coverage up to 80%. |
|||
* DONE - Pan monodix coverage up to 85% |
|||
* DONE - WER below 20% for hin-pan |
|||
* DONE - WER below 25% for pan-hin |
|||
* DONE - Making tagger unigram(@spectie). |
|||
* DONE - Adding .lsx files and modes for multiwords (mostly with preposition). |
|||
* DONE - Added bidix coverage script |
|||
* DONE - Added Verbs. |
|||
* DONE - Bidix updates. |
|||
* DONE - Fixed errors with postposition transfer. |
|||
* DONE - Added proper nouns. |
|||
* DONE - Added noun paradigms. |
|||
* DONE - Added Adverbs. |
|||
* DONE - Added about 1400 adjective stems |
|||
* DONE - Function words(cnj, det, prn, post, gen_endings), Coverage > 47% |
|||
* DONE - Collected parallel texts to calculate WER,PER etc.. |
|||
* DONE - Added bidirectional dictionary(33k paradigms) |
|||
* DONE - Fixed bidirectional translation i.e. pan->hin(gave close to human translation for small test set, even though similar transfer rules were copied) |
|||
* DONE - Scraped all Wikipedia texts and made a combined frequency list. |
|||
* DONE - Frequency lists using WikiExtractor on latest dump. |
|||
==IN PROGRESS== |
|||
<i>(Project Ended : Check #Future_Work)</i> |
|||
==TODO== |
|||
<i>(Project Ended : Check #Future_Work)</i> |
|||
== |
==Future Work== |
||
* Conjunct verbs in the Hindi monodix exist in the form of multi-words which isn't (1) necessary (2) not a great way of implementing this. While most of these are handled even while translating to punjabi because the first verb in the conjunct takes it form from an already existing verb without the second verb (that is, if it can exist alone). However, this can have some challenges in the chunking stage where this conjunct verb is read as two verbs(in the case of Punjabi). |
|||
* Complete prototype HTML deformatters |
|||
Solution : Use separable |
|||
** Current prototype code at https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code and https://github.com/SilentFlame/apertium/ |
|||
* Multiple paradigms, mostly in category <prn> use the "mf" in place of m-obj. This needs to be fixed in multiple places in the hindi, urdu dictionaries. For example : Number of subject-object matters for pronouns : ਮੇਰੀਆਂ | ਸਾਡੀ. मेरे(marked mf) | मेरा(marked m) |
|||
** Task: Create a clean pull request to https://github.com/unhammer with HTML deformatter and reformatter, including tests |
|||
* Tagger needs to be trained to pick verbs and post/pre positions preferred to adjectives, adverbs(this is not a regular trend and results are mostly decent) |
|||
* In some slangs, post positions are combined with words. Need to identify these. For examples : Add post-positions to definitions - ਹਸਪਤਾਲੋਂ = ਹਸਪਤਾਲ ਤੋਂ. It's a very irregular trend though and usually not used in written Punjabi. |
|||
* What is the analysis for ਦੋਸਤੋਂ ("Friends, come with me). The problem will be that doston in Hindi is the translation for this and friends being used as in (his friends did this). Not a big problem, but needs analysis to identify more such occurrences. |
|||
* Add transfer rule and check if tagger correctly identifies which one is being picked. ਮੈਂ is the translation for मैंने and मैं. |
|||
* Fix alternate spellings analyses. Mostly multiple spellings have been added but there could be some easy automatic fixes which can be done. For example : ਉਸ੍ਸਰ > ਉੱਸਰ, ਖਿਲ੍ਲਰ > ਖਿੱਲਰ |
|||
* Fix for borrowed words(mostly English) and Proper nouns. Can work on transliteration once these NEs can be correctly identified. |
|||
* Lexical selection rules need to be added. Could not be completed due to some errors in IRSTLM. For example, a simple rule : aisa <noun> - ajiha BUT aisa <verb> innj. |
|||
* Meeting testvoc requirements before release. |
|||
==Experience and Final words== |
|||
===Reformatters=== |
|||
Working with Apertium over the past three months has been nothing less than amazing. I don't think I ever had a problem understanding the pipe(the limited parts of it I used) because of the extensive documentation that exists. The community is probably the most helpful one I've ever been part of. Would love to see this pair out soon, after ofcourse a good analysis of my work. |
|||
==Literature(Apertium Wiki) Covered== |
|||
* Make reformat turn inline-blanks back into real tags |
|||
* DONE - Calculating Coverage. |
|||
** <nowiki>[{<i>}]foo [{<i><b>}]bar</nowiki> should become <i>foo</i> <i><b>bar</b></i> |
|||
* DONE - A long introduction on Transfer Rules. |
|||
** prototypes exist for this in https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code |
|||
* DONE - Transfer Rules examples |
|||
* DONE - Wikipedia Dumps. |
|||
* DONE - Generating Frequency lists. |
|||
* DONE - Building Dictionaries#Monolingual. |
|||
* DONE - Evaluation. |
|||
* DONE - Extract. |
|||
* DONE - Monodix Basics |
|||
* DONE - Improved Corpus Based Paradigm Matching. |
|||
* DONE - Transliteration. |
|||
* DONE - Workflow reference. |
|||
* DONE - Tagger Tranining. |
|||
* DONE - Modes introduction. |
|||
* DONE - Apertium-viewer. |
Latest revision as of 10:49, 7 September 2020
Contents
Summary[edit]
The goal of this project was to develop an analyzer for Punjabi along with translators apertium-hin-pan and apertium-pan-hin. Coverage for Punjabi data was calculated on Wikipedia data(using all articles from Punjabi Wikipedia covering about 3.6 million tokens). While very simple transfer rules are required for this pair, paradigms and the correct analyses of post positions and multi words was essential. The tagger was trained to unigram for this task along with the use of apertium-separable. More work needs to be done lexical selection rules and few more tasks(See #Future_Work). The results on Wikipedia articles were pretty decent, except in the cases where too many borrowed words or proper nouns existed in the article. This'll be a problem for the existing hin-urd pair as well. Adding so many spelling variations of proper nouns(transliterated to hindi) is a difficult task, so some other alternative needs to be explored. Overall, good results were achieved for Coverage(given that the Punjabi analyzer was picked up from a very nascent state. WER should improve once lexical selection is done on corpus.
Repos[edit]
- https://github.com/apertium/apertium-pan
- https://github.com/apertium/apertium-hin-pan
- https://github.com/apertium/apertium-hin
Most code was finally merged and can be found here : https://apertium.projectjj.com/gsoc2020/priyankmodiPM.html . Thanks to Tino :)
Main Work[edit]
Dictionaries[edit]
The major accomplishment of this project has been in the Punjabi monodix which reached 85% Coverage(with over 12,000 stems) on a total of 3.6 million tokens(entire Wikipedia corpus on Punjabi). In addition to new entries (added from frequency lists and crossdics), entries broken due to tag changes have been fixed, and the dictionary has been reorganized and cleaned. Additions have also been made to the Hindi monodix, also changing paradigm in cases handled by separable.
The bidix also has close to 17,500 translations and the pan-hin translator coverage is also just above 80% computed on the same corpus. Weights have been assigned to the entries as well because of multiple spelling options for a translation. This will be taken up in future work to automatically assign the correct spelling. The hin-pan translator also has decent coverage(~100% for post/prepositions, pronouns, proper nouns, adverbs, adjectives, intransitive verbs. >80% in case of adjectives. ~90% in case of transitive verbs. Most nouns high up/midway on the zipf curve have been added, but there's still quite a few with lower frequencies that remain to be added).
The pair still needs to go through testvoc iterations as there exist some verbs which existed in the bidix before the project. Most of these have been removed and Proper nouns, adverbs, adjectives and almost all nouns have been checked.
Separable[edit]
Thanks to the Lsx module, Apertium can efficiently handle multiwords even in cases one of the word is a suffix in some other word. Although, once IRSTLM is fixed, more examples can be analyzed which can benefit from the use of separable. @tanmai_khanna and @spectie were a huge help in guiding me through this part of the project :)
Tagger[edit]
The Hindi and Punjabi dictionaries were originally using taggers(.prob files) directly copied from the English module. @spectie helped me here to train a newer version and also making the taggers unigram.
Transfer Rules[edit]
Hindi and Punjabi are fairly even when it comes to syntax, so rules from the existing Urdu-Hindi module were mostly sufficient for this pair as well.
Quality[edit]
The translator performs good (tested on Wikipedia articles and stories taken from indic websites) as long as there aren't a lot of borrowed words or proper nouns with spelling variations. This remains to be challenge especially in the case of technical Wikipedia articles. On a generic Wiki article about the state of Pakistan, the translator achieved a WER of 10.84%.
Current tasks[edit]
(Project Ended)
Progress[edit]
Progress table[edit]
Week | Stems | Coverage | WER,PER | Progress | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
№ | dates | hin | pan | pan-hin | hin | pan | pan-hin | hin→pan | pan→hin | Evaluation | Notes |
1 | May 1-24 | - | - | - | - | 47% | - | - | - | - | Original coverage - 12.8% |
2 | May 24-31 | - | +1400 | - | - | 52% | - | 36.83%, 33.84% | 40.03%, 36.93% | - | WER, PER On a set of 25 sentences(612 words) |
3 | June 1-7 | - | +500 | +250 | - | 58% | - | 49.52%, 42.71% | 48.65%, 44.03% | - | WER, PER On a set of 50 sentences(1038 words) |
4 | June 8-14 | +50 | +1500 | +1000 | - | 63% | - | 41.36%, 35.82% | 41.47%, 37.03% | - | WER, PER On a set of 50 sentences(1038 words) |
5 | June 15-21 | - | +50 | +200 | - | 66% | 52.8% | 41.36%, 35.82% | 41.47%, 37.03% | - | WER, PER On a set of 50 sentences(1038 words) |
6 | June 22-28 | - | +350 | +700 | - | 70% | 59% | 39.76%, 34.22% | 40.03%, 36.23% | - | WER, PER On a set of 50 sentences(1038 words) |
7 | June 29-Jul 6 | - | +350 | +700 | - | 71.3% | 61% | 39.76%, 34.22% | 40.03%, 36.23% | - | WER, PER On a set of 50 sentences(1038 words) |
8 | July 7-13 | - | +1200 | +1600 | - | 73% | 63% | 39.76%, 34.22% | 40.03%, 36.23% | - | WER, PER On a set of 50 sentences(1038 words) |
9 | July 14-20 | - | +2000 | +2000 | - | 74% | 65% | 36.76%, 32.22% | 39.03%, 33.53% | - | WER, PER On a set of 100 sentences(2140 words) |
10 | July 21-27 | - | +2500 | +2200 | - | 75% | 67% | 33.76%, 29.62% | 35.03%, 31.23% | - | WER, PER On a set of 100 sentences(2140 words) |
11 | July 28-Aug 2 | - | +1200 | +1300 | - | 78% | 70% | 32.34%, 28.82% | 34.45%, 30.43% | - | WER, PER On a set of 100 sentences(2140 words) |
12 | Aug 3-9 | - | +1800 | +1100 | - | 81% | 70% | 28.04%, 25.85% | 30.15%, 29.73% | - | WER, PER On a set of 110 sentences(2140 words) |
13 | Aug 10-16 | +200 | +300 | +/-/*5000 | - | 83% | 80% | 22.04%, 19.75% | 26.15%, 23.13% | - | WER, PER On a set of 110 sentences(2140 words) |
14 | Aug 17-24 | +100 | +600 | +300 | - | 85% | 81% | 8.4%, 6.78% | 26.44%, 22.31% | - | WER, PER On a set of 180 sentences(3200 words) |
DONE[edit]
- DONE - Added WX transliterations to bidix.
- DONE - pan-hin translator coverage up to 80%.
- DONE - Pan monodix coverage up to 85%
- DONE - WER below 20% for hin-pan
- DONE - WER below 25% for pan-hin
- DONE - Making tagger unigram(@spectie).
- DONE - Adding .lsx files and modes for multiwords (mostly with preposition).
- DONE - Added bidix coverage script
- DONE - Added Verbs.
- DONE - Bidix updates.
- DONE - Fixed errors with postposition transfer.
- DONE - Added proper nouns.
- DONE - Added noun paradigms.
- DONE - Added Adverbs.
- DONE - Added about 1400 adjective stems
- DONE - Function words(cnj, det, prn, post, gen_endings), Coverage > 47%
- DONE - Collected parallel texts to calculate WER,PER etc..
- DONE - Added bidirectional dictionary(33k paradigms)
- DONE - Fixed bidirectional translation i.e. pan->hin(gave close to human translation for small test set, even though similar transfer rules were copied)
- DONE - Scraped all Wikipedia texts and made a combined frequency list.
- DONE - Frequency lists using WikiExtractor on latest dump.
IN PROGRESS[edit]
(Project Ended : Check #Future_Work)
TODO[edit]
(Project Ended : Check #Future_Work)
Future Work[edit]
- Conjunct verbs in the Hindi monodix exist in the form of multi-words which isn't (1) necessary (2) not a great way of implementing this. While most of these are handled even while translating to punjabi because the first verb in the conjunct takes it form from an already existing verb without the second verb (that is, if it can exist alone). However, this can have some challenges in the chunking stage where this conjunct verb is read as two verbs(in the case of Punjabi).
Solution : Use separable
- Multiple paradigms, mostly in category <prn> use the "mf" in place of m-obj. This needs to be fixed in multiple places in the hindi, urdu dictionaries. For example : Number of subject-object matters for pronouns : ਮੇਰੀਆਂ | ਸਾਡੀ. मेरे(marked mf) | मेरा(marked m)
- Tagger needs to be trained to pick verbs and post/pre positions preferred to adjectives, adverbs(this is not a regular trend and results are mostly decent)
- In some slangs, post positions are combined with words. Need to identify these. For examples : Add post-positions to definitions - ਹਸਪਤਾਲੋਂ = ਹਸਪਤਾਲ ਤੋਂ. It's a very irregular trend though and usually not used in written Punjabi.
- What is the analysis for ਦੋਸਤੋਂ ("Friends, come with me). The problem will be that doston in Hindi is the translation for this and friends being used as in (his friends did this). Not a big problem, but needs analysis to identify more such occurrences.
- Add transfer rule and check if tagger correctly identifies which one is being picked. ਮੈਂ is the translation for मैंने and मैं.
- Fix alternate spellings analyses. Mostly multiple spellings have been added but there could be some easy automatic fixes which can be done. For example : ਉਸ੍ਸਰ > ਉੱਸਰ, ਖਿਲ੍ਲਰ > ਖਿੱਲਰ
- Fix for borrowed words(mostly English) and Proper nouns. Can work on transliteration once these NEs can be correctly identified.
- Lexical selection rules need to be added. Could not be completed due to some errors in IRSTLM. For example, a simple rule : aisa <noun> - ajiha BUT aisa <verb> innj.
- Meeting testvoc requirements before release.
Experience and Final words[edit]
Working with Apertium over the past three months has been nothing less than amazing. I don't think I ever had a problem understanding the pipe(the limited parts of it I used) because of the extensive documentation that exists. The community is probably the most helpful one I've ever been part of. Would love to see this pair out soon, after ofcourse a good analysis of my work.
Literature(Apertium Wiki) Covered[edit]
- DONE - Calculating Coverage.
- DONE - A long introduction on Transfer Rules.
- DONE - Transfer Rules examples
- DONE - Wikipedia Dumps.
- DONE - Generating Frequency lists.
- DONE - Building Dictionaries#Monolingual.
- DONE - Evaluation.
- DONE - Extract.
- DONE - Monodix Basics
- DONE - Improved Corpus Based Paradigm Matching.
- DONE - Transliteration.
- DONE - Workflow reference.
- DONE - Tagger Tranining.
- DONE - Modes introduction.
- DONE - Apertium-viewer.