Difference between revisions of "User:Pyry/Sandbox"

From Apertium
Jump to navigation Jump to search
 
(29 intermediate revisions by 3 users not shown)
Line 8: Line 8:
http://www.uoc.edu/freerbmt11/
http://www.uoc.edu/freerbmt11/


Submission deadline: Nov 8
Submission deadline: <s>Nov 8</s> 20th November - Submission deadline


<pre>
<pre>
Line 40: Line 40:
===Introduction===
===Introduction===


* MT ((S/CB)MT and RBMT)
** Why SMT is difficult to do for agglutinative languages
** Also why RBMT is exciting and fun -- you can learn stuff about languages!, and SMT is boring -- you don't learn much :(
* MT from major to minor language
* MT from major to minor language
* MT between related languages
* MT between related languages
* MT between agglutinative closely-related languages: Turkish--{Tatar,Turkmen,...}
* MT between agglutinative languages: Turkish--{Tatar,Turkmen,...}
* MT between Finno-Urgic / Sámi languages


===Languages===
===Languages===

; Constrastive analysis Finnish and North Sámi

# Cases
# Tenses
# Behaviour of NPs (partial agreement in Sámi)


===Implementation===
===Implementation===
Line 50: Line 60:
; Tools
; Tools


* HFST
# HFST
* Constraint Grammar
# Constraint Grammar
* Apertium
# Apertium


; Problematic aspects
; Problematic aspects

Distinguishing pair-specific from general problems, and look at the general in the special:

* Tag-adjustment as a central part of FLOSS grammar projects
** solution: documentation, supersets??, adjustment?? can this case teach us smth
** The same for the points below
* Then general, linguistic topics


# Issues related to the merging of transducers
## Tagset differences (three-way: Omorfi, Fred's CG and Giellatekno)
## underspecification in omorfi (e.g. cc/cs vs. conj)
## overgeneration in sme -- transducers targetted at _analysis_, not generation.
## sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
# lexc makes trimming lexicons challenging. (principle: we don't want to analyse what we can't translate)
# Issus related to grammatical traditions
## differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
# Issues related to linguistic differences/characteristics
## differing lexicalisation ('saamelainen', 'kritiserema' vs. kritisoinnin)
# agglutination makes typical strategies for testing MT systems (e.g. testvoc) very difficult, instead rely on 'corpus test', of course the corpus might not contain all the corner cases. so difficult to claim "error free"
# Then what-is-this:
## compound words (f.eks. 'sankari' = san#kari)


===Evaluation===
===Evaluation===

;Statistics

# Number of bilingual dictionary entries
# Number of disambiguation rules
# Number of transfer rules

;Coverage


;Accuracy


===Discussion===
===Discussion===

; Advice

# Give advice for those building MT systems with Apertium
## from existing resources
## for agglutinative languages


; Future work
; Future work

# Expand the bilingual lexicon rapidly using Algu wordlist (xxx items)
# More transfer rules: give examples
# Lexical selection: give examples
# MT between other Finno-Urgic languages (fin-est, sme-sma, ...)


; Conclusion
; Conclusion

# Existing resources can be reused, but will almost certainly require adjustment for the purposes of MT.


===References===
===References===


* Tantuğ, A. Cüneyd and Adalı, Eşref and Oflazer, Kemal (2007) A MT System from Turkmen to Turkish employing finite state and statistical methods. In: XI. Machine Translation Summit, Copenhagen, Denmark
* Tantuğ, A. Cüneyd and Adalı, Eşref and Oflazer, Kemal (2007) A MT System from Turkmen to Turkish employing finite state and statistical methods. In: ''Machine Translation Summit XI'', Copenhagen, Denmark
* Kemal Altinas (2001) "TURKISH to CRIMEAN TATAR MACHINE TRANSLATION SYSTEM". Masters Thesis, Bilkent University.
* Abulfat Fatullayev and Samir Shagavatov (2008) "TURKISH-AZERBAIJANI TRANSLATION MODULE OF DILMANC MT SYSTEM". ''The Second International Conference “Problems of Cybernetics and Informatics”'' September 10-12, 2008, Baku, Azerbaijan
* Tyers, F. M. and Wiechetek, L. and Trosterud, T. (2009) "Developing prototypes for machine translation between two Sámi languages". ''Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09''. pp. 120--128
* Wiechetek, L. and Tyers, F. M. and Omma, T. (2010) "Shooting at flies in the dark: Rule-based lexical selection for a minority language pair". ''Lecture Notes in Artificial Intelligence Volume'' 6233/2010, pp. 418--429
* Tae Wan Kim, Jin Tae Lee, Chang Ho Park, Ki Sik Lee (1986) "MACHINE TRANSLATION OF THE URAL-ALTAIC AS AN AGGLUTINATIVE LANGUAGE". ''Proceedings of IAI-MT86, 20-22 August 1986''
* Muhtar MAHSUT, Yasuhiro OGAWA, Kazue SUGINO, Yasuyoshi INAGAKI () "Utilizing Agglutinative Features in Japanese-Uighur Machine Translation". MT Summit VIII.
* István Varga & Soichi Yokoyama: Transfer rule generation for a Japanese-Hungarian machine translation system. MT Summit XII: proceedings of the twelfth Machine Translation Summit, August 26-30, 2009, Ottawa, Ontario, Canada; pp.356-362. [PDF, 152KB]

Latest revision as of 09:59, 29 October 2010

Challenges in Finnish to North Sámi rule-based machine translation
Translating the Bible from Finnish to North Sámi
Trials and tribulations in Finnish to North Sámi rule-based machine translation

http://www.uoc.edu/freerbmt11/

Submission deadline: Nov 8 20th November - Submission deadline



(13:38:36) francis: 1) underspecification in omorfi (e.g. cc/cs vs. conj)
(13:39:08) francis: 2) differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
(13:39:28) francis: 3) overgeneration in sme
(13:40:27) francis: 4) sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
(13:41:20) francis: btw, we actually had a three way tagset disjunct
(13:41:29) francis: between omorfi, fred's CG and giellatekno
(13:44:30) ryan: one of the larger problems i thought was figuring out what exactly trying to do with compound words
(13:45:18) ryan: sankari probably 
(13:45:20) francis: i haveo ne 
(13:45:21) ryan: means hero 
(13:45:26) ryan: but it ended up with a compound analysis 
(13:45:27) ryan: san# kari 
(13:45:29) francis: saamelainen
(13:45:32) ryan: ooh, that too
(13:45:43) ryan: even more related ;) 
(13:46:18) francis: 6) differing lexicalisation

(13:46:25) francis: kritiserema	kritiseret+V+TV+Der3+Der/n+N+Sg+Acc  

vs. kritisoinnin	kritisointi+N+Sg+Gen

Paper[edit]

Introduction[edit]

  • MT ((S/CB)MT and RBMT)
    • Why SMT is difficult to do for agglutinative languages
    • Also why RBMT is exciting and fun -- you can learn stuff about languages!, and SMT is boring -- you don't learn much :(
  • MT from major to minor language
  • MT between related languages
  • MT between agglutinative languages: Turkish--{Tatar,Turkmen,...}
  • MT between Finno-Urgic / Sámi languages

Languages[edit]

Constrastive analysis Finnish and North Sámi
  1. Cases
  2. Tenses
  3. Behaviour of NPs (partial agreement in Sámi)

Implementation[edit]

Tools
  1. HFST
  2. Constraint Grammar
  3. Apertium
Problematic aspects

Distinguishing pair-specific from general problems, and look at the general in the special:

  • Tag-adjustment as a central part of FLOSS grammar projects
    • solution: documentation, supersets??, adjustment?? can this case teach us smth
    • The same for the points below
  • Then general, linguistic topics


  1. Issues related to the merging of transducers
    1. Tagset differences (three-way: Omorfi, Fred's CG and Giellatekno)
    2. underspecification in omorfi (e.g. cc/cs vs. conj)
    3. overgeneration in sme -- transducers targetted at _analysis_, not generation.
    4. sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
  2. lexc makes trimming lexicons challenging. (principle: we don't want to analyse what we can't translate)
  3. Issus related to grammatical traditions
    1. differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
  4. Issues related to linguistic differences/characteristics
    1. differing lexicalisation ('saamelainen', 'kritiserema' vs. kritisoinnin)
  5. agglutination makes typical strategies for testing MT systems (e.g. testvoc) very difficult, instead rely on 'corpus test', of course the corpus might not contain all the corner cases. so difficult to claim "error free"
  6. Then what-is-this:
    1. compound words (f.eks. 'sankari' = san#kari)

Evaluation[edit]

Statistics
  1. Number of bilingual dictionary entries
  2. Number of disambiguation rules
  3. Number of transfer rules
Coverage


Accuracy

Discussion[edit]

Advice
  1. Give advice for those building MT systems with Apertium
    1. from existing resources
    2. for agglutinative languages
Future work
  1. Expand the bilingual lexicon rapidly using Algu wordlist (xxx items)
  2. More transfer rules: give examples
  3. Lexical selection: give examples
  4. MT between other Finno-Urgic languages (fin-est, sme-sma, ...)
Conclusion
  1. Existing resources can be reused, but will almost certainly require adjustment for the purposes of MT.

References[edit]

  • Tantuğ, A. Cüneyd and Adalı, Eşref and Oflazer, Kemal (2007) A MT System from Turkmen to Turkish employing finite state and statistical methods. In: Machine Translation Summit XI, Copenhagen, Denmark
  • Kemal Altinas (2001) "TURKISH to CRIMEAN TATAR MACHINE TRANSLATION SYSTEM". Masters Thesis, Bilkent University.
  • Abulfat Fatullayev and Samir Shagavatov (2008) "TURKISH-AZERBAIJANI TRANSLATION MODULE OF DILMANC MT SYSTEM". The Second International Conference “Problems of Cybernetics and Informatics” September 10-12, 2008, Baku, Azerbaijan
  • Tyers, F. M. and Wiechetek, L. and Trosterud, T. (2009) "Developing prototypes for machine translation between two Sámi languages". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. pp. 120--128
  • Wiechetek, L. and Tyers, F. M. and Omma, T. (2010) "Shooting at flies in the dark: Rule-based lexical selection for a minority language pair". Lecture Notes in Artificial Intelligence Volume 6233/2010, pp. 418--429
  • Tae Wan Kim, Jin Tae Lee, Chang Ho Park, Ki Sik Lee (1986) "MACHINE TRANSLATION OF THE URAL-ALTAIC AS AN AGGLUTINATIVE LANGUAGE". Proceedings of IAI-MT86, 20-22 August 1986
  • Muhtar MAHSUT, Yasuhiro OGAWA, Kazue SUGINO, Yasuyoshi INAGAKI () "Utilizing Agglutinative Features in Japanese-Uighur Machine Translation". MT Summit VIII.
  • István Varga & Soichi Yokoyama: Transfer rule generation for a Japanese-Hungarian machine translation system. MT Summit XII: proceedings of the twelfth Machine Translation Summit, August 26-30, 2009, Ottawa, Ontario, Canada; pp.356-362. [PDF, 152KB]