Difference between revisions of "User:Pyry/Sandbox"

From Apertium
Jump to navigation Jump to search
Line 40: Line 40:
===Introduction===
===Introduction===


* MT ((S/CB)MT and RBMT)
* MT from major to minor language
* MT from major to minor language
* MT between related languages
* MT between related languages
* MT between agglutinative closely-related languages: Turkish--{Tatar,Turkmen,...}
* MT between agglutinative languages: Turkish--{Tatar,Turkmen,...}
* MT between Finno-Urgic / Sámi languages
* MT between Finno-Urgic / Sámi languages



Revision as of 17:28, 27 October 2010

Challenges in Finnish to North Sámi rule-based machine translation
Translating the Bible from Finnish to North Sámi
Trials and tribulations in Finnish to North Sámi rule-based machine translation

http://www.uoc.edu/freerbmt11/

Submission deadline: Nov 8



(13:38:36) francis: 1) underspecification in omorfi (e.g. cc/cs vs. conj)
(13:39:08) francis: 2) differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
(13:39:28) francis: 3) overgeneration in sme
(13:40:27) francis: 4) sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
(13:41:20) francis: btw, we actually had a three way tagset disjunct
(13:41:29) francis: between omorfi, fred's CG and giellatekno
(13:44:30) ryan: one of the larger problems i thought was figuring out what exactly trying to do with compound words
(13:45:18) ryan: sankari probably 
(13:45:20) francis: i haveo ne 
(13:45:21) ryan: means hero 
(13:45:26) ryan: but it ended up with a compound analysis 
(13:45:27) ryan: san# kari 
(13:45:29) francis: saamelainen
(13:45:32) ryan: ooh, that too
(13:45:43) ryan: even more related ;) 
(13:46:18) francis: 6) differing lexicalisation

(13:46:25) francis: kritiserema	kritiseret+V+TV+Der3+Der/n+N+Sg+Acc  

vs. kritisoinnin	kritisointi+N+Sg+Gen

Paper

Introduction

  • MT ((S/CB)MT and RBMT)
  • MT from major to minor language
  • MT between related languages
  • MT between agglutinative languages: Turkish--{Tatar,Turkmen,...}
  • MT between Finno-Urgic / Sámi languages

Languages

Constrastive analysis Finnish and North Sámi
  1. Cases
  2. Tenses
  3. Behaviour of NPs (partial agreement in Sámi)

Implementation

Tools
  1. HFST
  2. Constraint Grammar
  3. Apertium
Problematic aspects

Distinguishing pair-specific from general problems, and look at the general in the special:

  • Tag-adjustment as a central part of FLOSS grammar projects
    • solution: documentation, supersets??, adjustment?? can this case teach us smth
    • The same for the points below
  • Then general, linguistic topics


  1. Issues related to the merging of transducers
    1. Tagset differences (three-way: Omorfi, Fred's CG and Giellatekno)
    2. underspecification in omorfi (e.g. cc/cs vs. conj)
    3. overgeneration in sme -- transducers targetted at _analysis_, not generation.
    4. sometimes it wasn't clear when words were assigned to defective paradigms (e.g. some pronouns?? didn't decline in some cases)
  2. lexc makes trimming lexicons challenging. (principle: we don't want to analyse what we can't translate)
  3. Issus related to grammatical traditions
    1. differing grammatical traditions (acc/gen??) merge in omorfi but not in GT
  4. Issues related to linguistic differences/characteristics
    1. differing lexicalisation ('saamelainen', 'kritiserema' vs. kritisoinnin)
  5. agglutination makes typical strategies for testing MT systems (e.g. testvoc) very difficult, instead rely on 'corpus test', of course the corpus might not contain all the corner cases. so difficult to claim "error free"
  6. Then what-is-this:
    1. compound words (f.eks. 'sankari' = san#kari)

Evaluation

Statistics
  1. Number of bilingual dictionary entries
  2. Number of disambiguation rules
  3. Number of transfer rules
Coverage


Accuracy

Discussion

Advice
  1. Give advice for those building MT systems with Apertium
    1. from existing resources
    2. for agglutinative languages
Future work
  1. Expand the bilingual lexicon rapidly using Algu wordlist (xxx items)
  2. MT between other Finno-Urgic languages (fin-est, sme-sma, ...)
Conclusion
  1. Existing resources can be reused, but will almost certainly require adjustment for the purposes of MT.

References

  • Tantuğ, A. Cüneyd and Adalı, Eşref and Oflazer, Kemal (2007) A MT System from Turkmen to Turkish employing finite state and statistical methods. In: Machine Translation Summit XI, Copenhagen, Denmark
  • Kemal Altinas (2001) "TURKISH to CRIMEAN TATAR MACHINE TRANSLATION SYSTEM". Masters Thesis, Bilkent University.
  • Abulfat Fatullayev and Samir Shagavatov (2008) "TURKISH-AZERBAIJANI TRANSLATION MODULE OF DILMANC MT SYSTEM". The Second International Conference “Problems of Cybernetics and Informatics” September 10-12, 2008, Baku, Azerbaijan
  • Tyers, F. M. and Wiechetek, L. and Trosterud, T. (2009) "Developing prototypes for machine translation between two Sámi languages". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. pp. 120--128
  • Wiechetek, L. and Tyers, F. M. and Omma, T. (2010) "Shooting at flies in the dark: Rule-based lexical selection for a minority language pair". Lecture Notes in Artificial Intelligence Volume 6233/2010, pp. 418--429
  • Tae Wan Kim, Jin Tae Lee, Chang Ho Park, Ki Sik Lee (1986) "MACHINE TRANSLATION OF THE URAL-ALTAIC AS AN AGGLUTINATIVE LANGUAGE". Proceedings of IAI-MT86, 20-22 August 1986
  • Muhtar MAHSUT, Yasuhiro OGAWA, Kazue SUGINO, Yasuyoshi INAGAKI () "Utilizing Agglutinative Features in Japanese-Uighur Machine Translation". MT Summit VIII.