Difference between revisions of "User:Shraier/reports"

From Apertium
Jump to navigation Jump to search
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Community Bonding Period==
==Community Bonding Period==
In this period I devoted myself into the Apertium system. I've established the new language pair and added it to the SVN repository. I've also tried to connect with other community members and as far as I can see they are all very kind.

I also started the work of the coding period and wrote a short list of the changes and fixes I did:

===Week 1===
===Week 1===
* Apertium Wiki: User account and user page created
* Apertium Wiki: User account and user page created
Line 12: Line 16:
* Slovenian monolingual dictionary clean up - Paradigm "njihov/__prn" cleanup + some checks
* Slovenian monolingual dictionary clean up - Paradigm "njihov/__prn" cleanup + some checks
* Slovenian monolingual dictionary clean up - Paradigm "tolikš/en__prn", lemmas drugašen, kolikšen, nekakšen, tolikšen
* Slovenian monolingual dictionary clean up - Paradigm "tolikš/en__prn", lemmas drugašen, kolikšen, nekakšen, tolikšen
* Slovenian monolingual dictionary clean up - Paradigm "tolik/__prn", lemmas: enak, kak, kolik, nekak, tolik
* Slovenian monolingual dictionary clean up - Added paradigm "tolikš/en__prn" to lemma="kakršen";
* Slovenian monolingual dictionary clean up - Deleted lm="enako", par fixes= "nekat/i__prn", "kater/i__prn", "k/do__prn", "k/dor__prn"; deleted par="koli/__prn" and lm="koli" (does not exist), added par="koliko__prn"
* Slovenian monolingual dictionary clean up - Deleted lm="mnogo", "onega", "precej", "tainta"; Deleted par n="nekoliko/__prn", "nikogar/__prn", "precej/__prn", "t/ainta__prn"
* Slovenian monolingual dictionary clean up - Added "oba" as determinative/pronoun
* Slovenian monolingual dictionary clean up and fixes

===Week 3===
* Slovenian monolingual dictionary clean up - Deleted lm="najnajin", "tega", "toliko", "vsakogar"; deleted par="naj/najin__prn", "tistega/__prn", "toliko/__prn", "vsakogar/__prn"; More fixes
* Slovenian monolingual dictionary clean up - More fixes
* Slovenian monolingual dictionary clean up - Added paradigm "barvil/o__n" and fixed a group of nouns (~240)
* Slovenian monolingual dictionary clean up - Added paradigm "akrobatik/a__n", more fixes

===Week 4===
* Slovenian monolingual dictionary clean up - A lot of fixes and checks

==Coding period==
Here we are in the coding period.
<br /><br />
'''Important notes:'''<br />
- We decided (me and my mentor) to edit the "week plan" a little bit. Now the first two weeks are intended for the "Correction of errors of the Slovenian monolingual morphology (manual)" and the third week for the "Correction of the differences in source and target tag sets of the morphological dictionaries".

===Week 1===
'''Work to be done:''' Correction of errors of the Slovenian monolingual morphology (manual) <br />
'''Result:'''
* Slovenian monolingual dictionary clean up - Nouns + Proper nouns (~7280 lemmas and ~12500 lines of paradigms)
** All the paradigm entries had to be checked, all duplicates had to be removed and all missing entries had to be added. I also had to remake and split paradigm entries for "Proper Names" with different tags (now we have all proper names grouped in 3 groups - .ant, .cog, .top)
** Around 1200 lemmas have been removed - not properly tagged

===Week 2===
'''Work to be done:''' Correction of errors of the Slovenian monolingual morphology (manual) <br />
'''Result:'''
* Slovenian monolingual dictionary clean up - Interjections and Abbreviations checked, Adverbs completely remade + Part of Adjectives
** Adverbs: All lemmas had to be checked - I had to group all lemmas (which have the same paradigm entries) and make new paradigms for them (it contained only the default paradigm). ~3000 lemmas
** Duplicates/non-adverbs have been deleted, ~1200 lemmas
** Adjectives: Lemmas are now linked to paradigms (each type to each paradigm - pst, comp, sup, ela) which are linked to two paradigms containing all entries needed. lm4 -> par4 (adj.pst/copm/sup/ela) -> par2(other tags)

===Week 3===
'''Work to be done:''' Correction of errors of the Slovenian monolingual morphology (manual); Correction of the differences in source and target tag sets of the morphological dictionaries (possibly write rules for this part). <br />
'''Result:'''
* Slovenian monolingual dictionary clean up - All Adjectives completely remade
** Adjectives: All lemmas had to be checked - ~7000 lemmas; All duplicates have been removed and all paradigms remade (~9500 lines of paradigm entries)
** Source and target tag sets - Rules will be written after the bilingual dictionary is made

===Week 4===
'''Work to be done:''' Correction of errors of the Slovenian monolingual morphology (manual); Preparation of the automatic evaluation framework based on METEOR. <br />
'''Result:'''
* Slovenian monolingual dictionary clean up - All verbs completely remade
** Verbs: All lemmas had to be checked - ~3500 lemmas; All duplicates have been removed and all paradigms remade (~8600 lines of paradigm entries)
** Verb paradigms: all verbs are tagged as 'perfective' or 'imperfective' (or both), Left participles have been tagged as 'lp' and passives as 'pp'
* Preparation of the automatic evaluation framework based on METEOR
** The automatic evaluation framework based on METEOR has been prepared

<br />
'''Deliverable #1:''' Cleaned Slovenian morphology.<br />
'''Deliverable #2:''' Evaluation system.

===Week 5 and 6===
'''Work to be done:''' Compile the bilingual translational dictionary using Google Translate; Compile the bilingual translation dictionary using the method presented in [3] <br />
'''Result:'''
* Slovenian monolingual dictionary clean up - Added MA and MI tags to Nouns and Proper Nouns
* Sdefs in the SL monolingual dictionary have been translated to SLO language (tags) - JimRegan idea
* MOSES system has been established for sl-es
* Compiling of sl-es bilingual dictionary started:
** Made like 20 different scripts to make different intersects between SL monodix, SL-IT bidix and IT monodix. Verbs have been translated with Google Translate.
*** Now we have like 4500 lemmas sl-it which have to be translated to ES with it-es bidix (manual revision of IT lemmas needed)
*** Now we have like 850 verbs (lemmas) which also have to be translated from IT to ES with it-es bidix (manual revision of IT lemmas needed)
** Other lemmas translated to ES with Google Translate (need revision)
*** 3800 lemmas translated to ES with Google Translate that ARE in the ES monodix - need manual revision (generated bidix form)

<br />
'''Deliverable #3:''' MOSES System for SL-ES pair.

===Week 7===
'''Work to be done:''' Manual revision of the bilingual translation dictionary entries <br />
'''Result:'''
* Slovenian monolingual dictionary clean up - Added num lemmas - ordinal and cardinal (from 0 to 100)
* SL-ES bilingual dictionary:
** 5900 lemmas added to the bilingual dictionary
* Additional work for Midterm article translation

===Week 8===
'''Work to be done:''' Manual revision of the bilingual translation dictionary entries <br />
'''Result:'''
* SL-ES bilingual dictionary:
** Around 3000 lemmas added to the bilingual dictionary
* Additional work on the sl-es tagger

===Week 9 and 10===
'''Work to be done:''' Manual revision of the bilingual translation dictionary entries <br />
'''Result:'''
* Slovenian monolingual dictionary clean up:
** Added gerund form of verbs to cover nouns - "glagolniki"
* Spanish monolingual dictionary clean up:
** Added around 250 missing lemmas (taken from es-ca)
** Added lemmas "de _something_"
* SL-ES bilingual dictionary:
** Around 600 lemmas added to the bilingual dictionary
** Perfective/Imperfective secondary types of lemmas added as r="LR"
** Covered the first stage of missing words provided by Francis
<br />
<b>Deliverable #3:</b> Bilingual dictionary for sl-es.

===Week 11 and 12===
'''Work to be done:''' Manual revision of the bilingual translation dictionary entries; Compile transfer rules according to the contrastive grammar <br />
'''Result:'''
* Slovenian monolingual dictionary clean up:
** Cleared the unused lemmas
* Spanish monolingual dictionary clean up:
* SL-ES bilingual dictionary:
** Added around 650 lemmas from Jim's crossdix
** Translated around 1300 lemmas from Francis' list of missing words - around 750 were added to the bidix.
* Additional work has been made on the sl-es Tagger
* Transfer rules and macros
** Basic transfer rules + macros added for verbs - t1x.

Latest revision as of 14:37, 26 August 2011

Community Bonding Period[edit]

In this period I devoted myself into the Apertium system. I've established the new language pair and added it to the SVN repository. I've also tried to connect with other community members and as far as I can see they are all very kind.

I also started the work of the coding period and wrote a short list of the changes and fixes I did:

Week 1[edit]

  • Apertium Wiki: User account and user page created
  • Slovenian monolingual dictionary clean up - Deleted paradigm: <pardef n="/prpers__n"> //<-- WTF IS THIS
  • Slovenian monolingual dictionary clean up - Remake of pronouns (n="/prpers__n") - Personal and emphatic
  • Slovenian monolingual dictionary clean up - More checks, edited n="k/arkoli__prn" to n="/karkoli__prn" (it contains "česarkoli", "čemurkoli" and "čimerkoli")
  • Slovenian monolingual dictionary clean up - More checks, cleaned duplicates in n="vajin/__prn"

Week 2[edit]

  • Made a script to sort paradigm entries (has some minor bugs)
  • Slovenian monolingual dictionary clean up - Took p2 from "vajin/__prn" and made a new paradigm "njen/__pr" - set to lm="njen" and "njun"
  • Slovenian monolingual dictionary clean up - Paradigm "njihov/__prn" cleanup + some checks
  • Slovenian monolingual dictionary clean up - Paradigm "tolikš/en__prn", lemmas drugašen, kolikšen, nekakšen, tolikšen
  • Slovenian monolingual dictionary clean up - Paradigm "tolik/__prn", lemmas: enak, kak, kolik, nekak, tolik
  • Slovenian monolingual dictionary clean up - Added paradigm "tolikš/en__prn" to lemma="kakršen";
  • Slovenian monolingual dictionary clean up - Deleted lm="enako", par fixes= "nekat/i__prn", "kater/i__prn", "k/do__prn", "k/dor__prn"; deleted par="koli/__prn" and lm="koli" (does not exist), added par="koliko__prn"
  • Slovenian monolingual dictionary clean up - Deleted lm="mnogo", "onega", "precej", "tainta"; Deleted par n="nekoliko/__prn", "nikogar/__prn", "precej/__prn", "t/ainta__prn"
  • Slovenian monolingual dictionary clean up - Added "oba" as determinative/pronoun
  • Slovenian monolingual dictionary clean up and fixes

Week 3[edit]

  • Slovenian monolingual dictionary clean up - Deleted lm="najnajin", "tega", "toliko", "vsakogar"; deleted par="naj/najin__prn", "tistega/__prn", "toliko/__prn", "vsakogar/__prn"; More fixes
  • Slovenian monolingual dictionary clean up - More fixes
  • Slovenian monolingual dictionary clean up - Added paradigm "barvil/o__n" and fixed a group of nouns (~240)
  • Slovenian monolingual dictionary clean up - Added paradigm "akrobatik/a__n", more fixes

Week 4[edit]

  • Slovenian monolingual dictionary clean up - A lot of fixes and checks

Coding period[edit]

Here we are in the coding period.

Important notes:
- We decided (me and my mentor) to edit the "week plan" a little bit. Now the first two weeks are intended for the "Correction of errors of the Slovenian monolingual morphology (manual)" and the third week for the "Correction of the differences in source and target tag sets of the morphological dictionaries".

Week 1[edit]

Work to be done: Correction of errors of the Slovenian monolingual morphology (manual)
Result:

  • Slovenian monolingual dictionary clean up - Nouns + Proper nouns (~7280 lemmas and ~12500 lines of paradigms)
    • All the paradigm entries had to be checked, all duplicates had to be removed and all missing entries had to be added. I also had to remake and split paradigm entries for "Proper Names" with different tags (now we have all proper names grouped in 3 groups - .ant, .cog, .top)
    • Around 1200 lemmas have been removed - not properly tagged

Week 2[edit]

Work to be done: Correction of errors of the Slovenian monolingual morphology (manual)
Result:

  • Slovenian monolingual dictionary clean up - Interjections and Abbreviations checked, Adverbs completely remade + Part of Adjectives
    • Adverbs: All lemmas had to be checked - I had to group all lemmas (which have the same paradigm entries) and make new paradigms for them (it contained only the default paradigm). ~3000 lemmas
    • Duplicates/non-adverbs have been deleted, ~1200 lemmas
    • Adjectives: Lemmas are now linked to paradigms (each type to each paradigm - pst, comp, sup, ela) which are linked to two paradigms containing all entries needed. lm4 -> par4 (adj.pst/copm/sup/ela) -> par2(other tags)

Week 3[edit]

Work to be done: Correction of errors of the Slovenian monolingual morphology (manual); Correction of the differences in source and target tag sets of the morphological dictionaries (possibly write rules for this part).
Result:

  • Slovenian monolingual dictionary clean up - All Adjectives completely remade
    • Adjectives: All lemmas had to be checked - ~7000 lemmas; All duplicates have been removed and all paradigms remade (~9500 lines of paradigm entries)
    • Source and target tag sets - Rules will be written after the bilingual dictionary is made

Week 4[edit]

Work to be done: Correction of errors of the Slovenian monolingual morphology (manual); Preparation of the automatic evaluation framework based on METEOR.
Result:

  • Slovenian monolingual dictionary clean up - All verbs completely remade
    • Verbs: All lemmas had to be checked - ~3500 lemmas; All duplicates have been removed and all paradigms remade (~8600 lines of paradigm entries)
    • Verb paradigms: all verbs are tagged as 'perfective' or 'imperfective' (or both), Left participles have been tagged as 'lp' and passives as 'pp'
  • Preparation of the automatic evaluation framework based on METEOR
    • The automatic evaluation framework based on METEOR has been prepared


Deliverable #1: Cleaned Slovenian morphology.
Deliverable #2: Evaluation system.

Week 5 and 6[edit]

Work to be done: Compile the bilingual translational dictionary using Google Translate; Compile the bilingual translation dictionary using the method presented in [3]
Result:

  • Slovenian monolingual dictionary clean up - Added MA and MI tags to Nouns and Proper Nouns
  • Sdefs in the SL monolingual dictionary have been translated to SLO language (tags) - JimRegan idea
  • MOSES system has been established for sl-es
  • Compiling of sl-es bilingual dictionary started:
    • Made like 20 different scripts to make different intersects between SL monodix, SL-IT bidix and IT monodix. Verbs have been translated with Google Translate.
      • Now we have like 4500 lemmas sl-it which have to be translated to ES with it-es bidix (manual revision of IT lemmas needed)
      • Now we have like 850 verbs (lemmas) which also have to be translated from IT to ES with it-es bidix (manual revision of IT lemmas needed)
    • Other lemmas translated to ES with Google Translate (need revision)
      • 3800 lemmas translated to ES with Google Translate that ARE in the ES monodix - need manual revision (generated bidix form)


Deliverable #3: MOSES System for SL-ES pair.

Week 7[edit]

Work to be done: Manual revision of the bilingual translation dictionary entries
Result:

  • Slovenian monolingual dictionary clean up - Added num lemmas - ordinal and cardinal (from 0 to 100)
  • SL-ES bilingual dictionary:
    • 5900 lemmas added to the bilingual dictionary
  • Additional work for Midterm article translation

Week 8[edit]

Work to be done: Manual revision of the bilingual translation dictionary entries
Result:

  • SL-ES bilingual dictionary:
    • Around 3000 lemmas added to the bilingual dictionary
  • Additional work on the sl-es tagger

Week 9 and 10[edit]

Work to be done: Manual revision of the bilingual translation dictionary entries
Result:

  • Slovenian monolingual dictionary clean up:
    • Added gerund form of verbs to cover nouns - "glagolniki"
  • Spanish monolingual dictionary clean up:
    • Added around 250 missing lemmas (taken from es-ca)
    • Added lemmas "de _something_"
  • SL-ES bilingual dictionary:
    • Around 600 lemmas added to the bilingual dictionary
    • Perfective/Imperfective secondary types of lemmas added as r="LR"
    • Covered the first stage of missing words provided by Francis


Deliverable #3: Bilingual dictionary for sl-es.

Week 11 and 12[edit]

Work to be done: Manual revision of the bilingual translation dictionary entries; Compile transfer rules according to the contrastive grammar
Result:

  • Slovenian monolingual dictionary clean up:
    • Cleared the unused lemmas
  • Spanish monolingual dictionary clean up:
  • SL-ES bilingual dictionary:
    • Added around 650 lemmas from Jim's crossdix
    • Translated around 1300 lemmas from Francis' list of missing words - around 750 were added to the bidix.
  • Additional work has been made on the sl-es Tagger
  • Transfer rules and macros
    • Basic transfer rules + macros added for verbs - t1x.