User:Pyry/Sandbox

From Apertium
Jump to navigation Jump to search
Challenges in Finnish to North Sámi rule-based machine translation
Translating the Bible from Finnish to North Sámi
Trials and tribulations in Finnish to North Sámi rule-based machine translation

http://www.uoc.edu/freerbmt11/

Submission deadline: Nov 8



(13:38:36) francis: 1) underspecification in omorfi (e.g. cc/cs vs. conj)
(13:39:08) francis: 2) differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
(13:39:28) francis: 3) overgeneration in sme
(13:40:27) francis: 4) sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
(13:41:20) francis: btw, we actually had a three way tagset disjunct
(13:41:29) francis: between omorfi, fred's CG and giellatekno
(13:44:30) ryan: one of the larger problems i thought was figuring out what exactly trying to do with compound words
(13:45:18) ryan: sankari probably 
(13:45:20) francis: i haveo ne 
(13:45:21) ryan: means hero 
(13:45:26) ryan: but it ended up with a compound analysis 
(13:45:27) ryan: san# kari 
(13:45:29) francis: saamelainen
(13:45:32) ryan: ooh, that too
(13:45:43) ryan: even more related ;) 
(13:46:18) francis: 6) differing lexicalisation

(13:46:25) francis: kritiserema	kritiseret+V+TV+Der3+Der/n+N+Sg+Acc  

vs. kritisoinnin	kritisointi+N+Sg+Gen

Paper

Introduction

  • MT from major to minor language
  • MT between related languages
  • MT between agglutinative closely-related languages: Turkish--{Tatar,Turkmen,...}
  • MT between Finno-Urgic / Sámi languages

Languages

Constrastive analysis Finnish and North Sámi
  1. Cases
  2. Tenses
  3. Behaviour of NPs (partial agreement in Sámi)

Implementation

Tools
  1. HFST
  2. Constraint Grammar
  3. Apertium
Problematic aspects

Distinguishing pair-specific from general problems, and look at the general in the special:

  • Tag-adjustment as a central part of FLOSS grammar projects
    • solution: documentation, supersets??, adjustment?? can this case teach us smth
    • The same for the points below
  • Then general, linguistic topics
  1. Tagset differences (three-way: Omorfi, Fred's CG and Giellatekno)
  2. underspecification in omorfi (e.g. cc/cs vs. conj)
  3. differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
  4. overgeneration in sme -- transducers targetted at _analysis_, not generation.
  5. sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
  6. differing lexicalisation ('saamelainen', 'kritiserema' vs. kritisoinnin)
  7. compound words (f.eks. 'sankari' = san#kari)
  8. agglutination makes typical strategies for testing MT systems (e.g. testvoc) very difficult, instead rely on 'corpus test', of course the corpus might not contain all the corner cases. so difficult to claim "error free"
  9. lexc makes trimming lexicons challenging. (principle: we don't want to analyse what we can't translate)

Evaluation

Coverage


Accuracy

Discussion

Advice
  1. Give advice for those building MT systems with Apertium
    1. from existing resources
    2. for agglutinative languages
Future work
  1. Expand the bilingual lexicon rapidly using Algu wordlist (xxx items)
Conclusion
  1. Existing resources can be reused, but will almost certainly require adjustment for the purposes of MT.

References

  • Tantuğ, A. Cüneyd and Adalı, Eşref and Oflazer, Kemal (2007) A MT System from Turkmen to Turkish employing finite state and statistical methods. In: Machine Translation Summit XI, Copenhagen, Denmark
  • Kemal Altinas (2001) "TURKISH to CRIMEAN TATAR MACHINE TRANSLATION SYSTEM". Masters Thesis, Bilkent University.
  • Abulfat Fatullayev and Samir Shagavatov (2008) "TURKISH-AZERBAIJANI TRANSLATION MODULE OF DILMANC MT SYSTEM". The Second International Conference “Problems of Cybernetics and Informatics” September 10-12, 2008, Baku, Azerbaijan
  • Tyers, F. M. and Wiechetek, L. and Trosterud, T. (2009) "Developing prototypes for machine translation between two Sámi languages". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. pp. 120--128
  • Wiechetek, L. and Tyers, F. M. and Omma, T. (2010) "Shooting at flies in the dark: Rule-based lexical selection for a minority language pair". Lecture Notes in Artificial Intelligence Volume 6233/2010, pp. 418--429