Difference between revisions of "User:Pyry/Sandbox"

From Apertium
Jump to navigation Jump to search
Line 70: Line 70:
# differing lexicalisation ('saamelainen', 'kritiserema' vs. kritisoinnin)
# differing lexicalisation ('saamelainen', 'kritiserema' vs. kritisoinnin)
# compound words (f.eks. 'sankari' = san#kari)
# compound words (f.eks. 'sankari' = san#kari)
# agglutination makes typical strategies for testing MT systems (e.g. testvoc) very difficult, instead rely on 'corpus test'


===Evaluation===
===Evaluation===

Revision as of 16:56, 27 October 2010

Challenges in Finnish to North Sámi rule-based machine translation
Translating the Bible from Finnish to North Sámi
Trials and tribulations in Finnish to North Sámi rule-based machine translation

http://www.uoc.edu/freerbmt11/

Submission deadline: Nov 8



(13:38:36) francis: 1) underspecification in omorfi (e.g. cc/cs vs. conj)
(13:39:08) francis: 2) differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
(13:39:28) francis: 3) overgeneration in sme
(13:40:27) francis: 4) sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
(13:41:20) francis: btw, we actually had a three way tagset disjunct
(13:41:29) francis: between omorfi, fred's CG and giellatekno
(13:44:30) ryan: one of the larger problems i thought was figuring out what exactly trying to do with compound words
(13:45:18) ryan: sankari probably 
(13:45:20) francis: i haveo ne 
(13:45:21) ryan: means hero 
(13:45:26) ryan: but it ended up with a compound analysis 
(13:45:27) ryan: san# kari 
(13:45:29) francis: saamelainen
(13:45:32) ryan: ooh, that too
(13:45:43) ryan: even more related ;) 
(13:46:18) francis: 6) differing lexicalisation

(13:46:25) francis: kritiserema	kritiseret+V+TV+Der3+Der/n+N+Sg+Acc  

vs. kritisoinnin	kritisointi+N+Sg+Gen

Paper

Introduction

  • MT from major to minor language
  • MT between related languages
  • MT between agglutinative closely-related languages: Turkish--{Tatar,Turkmen,...}
  • MT between Finno-Urgic / Sámi languages

Languages

Constrastive analysis Finnish and North Sámi
  1. Cases
  2. Tenses
  3. Behaviour of NPs (partial agreement in Sámi)

Implementation

Tools
  1. HFST
  2. Constraint Grammar
  3. Apertium
Problematic aspects
  1. Tagset differences (three-way: Omorfi, Fred's CG and Giellatekno)
  2. underspecification in omorfi (e.g. cc/cs vs. conj)
  3. differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
  4. overgeneration in sme -- transducers targetted at _analysis_, not generation.
  5. sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
  6. differing lexicalisation ('saamelainen', 'kritiserema' vs. kritisoinnin)
  7. compound words (f.eks. 'sankari' = san#kari)
  8. agglutination makes typical strategies for testing MT systems (e.g. testvoc) very difficult, instead rely on 'corpus test'

Evaluation

Coverage


Accuracy

Discussion

Future work
Conclusion

References

  • Tantuğ, A. Cüneyd and Adalı, Eşref and Oflazer, Kemal (2007) A MT System from Turkmen to Turkish employing finite state and statistical methods. In: Machine Translation Summit XI, Copenhagen, Denmark
  • Kemal Altinas (2001) "TURKISH to CRIMEAN TATAR MACHINE TRANSLATION SYSTEM". Masters Thesis, Bilkent University.
  • Abulfat Fatullayev and Samir Shagavatov (2008) "TURKISH-AZERBAIJANI TRANSLATION MODULE OF DILMANC MT SYSTEM". The Second International Conference “Problems of Cybernetics and Informatics” September 10-12, 2008, Baku, Azerbaijan
  • Tyers, F. M. and Wiechetek, L. and Trosterud, T. (2009) "Developing prototypes for machine translation between two Sámi languages". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. pp. 120--128
  • Wiechetek, L. and Tyers, F. M. and Omma, T. (2010) "Shooting at flies in the dark: Rule-based lexical selection for a minority language pair". Lecture Notes in Artificial Intelligence Volume 6233/2010, pp. 418--429