Difference between revisions of "User:Ilienert/GSocApplication2010"

From Apertium
Jump to navigation Jump to search
(Created page with '== GSoC 2010 Proposal: Detecting hidden unknown words == '''Name:''' Ian Lienert<br /> '''E-mail Address:''' ian.lienert@gmail.com<br /> '''Phone:''' 1-647-885-0840<br /> '''Sky…')
 
 
Line 1: Line 1:
== Contact Information ==
== GSoC 2010 Proposal: Detecting hidden unknown words ==


'''Name:''' Ian Lienert<br />
'''Name:''' Ian Lienert<br />
'''E-mail Address:''' ian.lienert@gmail.com<br />
'''E-mail Address:''' ian.lienert@gmail.com<br />
'''Phone:''' 1-647-885-0840<br />
'''Phone:''' 1-647-885-0840<br />
'''Skype/IRC Nick:''' ilienert<br />
'''Skype/IRC/SourceForge Nick:''' ilienert<br />


== Why Machine Translation? ==
----
I have always been fascinated by the most intractable problems in computer science. This is the fundamental reason I decided to become a part of the field. Amongst all of the "difficult" problems out there, I cannot perceive a more daunting one than the mastery of natural language processing. The domains of NLP problems are loosely defined and not fully understood by those whose study them. Further adding to the chaos is the sheer number of language pairs in existence. To tackle this kind of problem, collaboration is absolutely necessary and the level of thinking must be as abstract as possible, lest we be faced with an unrealistically large corpus of low-level translation rules and an unintelligible code base. This, then, is what drives me to be a part of NLP and the formalizing of the informal.

== Why Apertium? ==
As implied above, RBMT interests me the most. This is due to the fact that it relies on the rules of language itself rather than simply what has worked well in the past as can be seen in strictly SMT systems. I have always been seeking projects that will inform and educate me. Through helping people translate language, I will have gained a better understand of how language works. The particular aspect that I wish to work on, modifying the PoS tagger to detect unknown words, deals with the part of NLP that challenges RBMT -- ambiguity. Apertium uses statistical methods to attempt to disambiguate words. Though this can be seen as SMT, it is likely a necessary step due to the fact that humans must infer context through learning.

== My Intended Task ==
I plan to perform the modifications to the PoS tagger so as to [[Ideas_for_Google_Summer_of_Code/Detect_hidden_unknown_words|detect unknown words]]. Specifically, I will assign open-class tags to words in the train() function. I will modify the Viterbi algorithm to compare the maximum transition probability of a surface form with the emission probability of

Latest revision as of 19:25, 7 April 2010

Contact Information[edit]

Name: Ian Lienert
E-mail Address: ian.lienert@gmail.com
Phone: 1-647-885-0840
Skype/IRC/SourceForge Nick: ilienert

Why Machine Translation?[edit]

I have always been fascinated by the most intractable problems in computer science. This is the fundamental reason I decided to become a part of the field. Amongst all of the "difficult" problems out there, I cannot perceive a more daunting one than the mastery of natural language processing. The domains of NLP problems are loosely defined and not fully understood by those whose study them. Further adding to the chaos is the sheer number of language pairs in existence. To tackle this kind of problem, collaboration is absolutely necessary and the level of thinking must be as abstract as possible, lest we be faced with an unrealistically large corpus of low-level translation rules and an unintelligible code base. This, then, is what drives me to be a part of NLP and the formalizing of the informal.

Why Apertium?[edit]

As implied above, RBMT interests me the most. This is due to the fact that it relies on the rules of language itself rather than simply what has worked well in the past as can be seen in strictly SMT systems. I have always been seeking projects that will inform and educate me. Through helping people translate language, I will have gained a better understand of how language works. The particular aspect that I wish to work on, modifying the PoS tagger to detect unknown words, deals with the part of NLP that challenges RBMT -- ambiguity. Apertium uses statistical methods to attempt to disambiguate words. Though this can be seen as SMT, it is likely a necessary step due to the fact that humans must infer context through learning.

My Intended Task[edit]

I plan to perform the modifications to the PoS tagger so as to detect unknown words. Specifically, I will assign open-class tags to words in the train() function. I will modify the Viterbi algorithm to compare the maximum transition probability of a surface form with the emission probability of