Difference between revisions of "User:Chenxiajian/GSoCProposal"
Chenxiajian (talk | contribs) |
Chenxiajian (talk | contribs) |
||
Line 13: | Line 13: | ||
'''Benefits to the Apertium :''' <br /> |
'''Benefits to the Apertium :''' <br /> |
||
'''Detect 'hidden' unknown words in Apertium'''''' |
'''Detect 'hidden' unknown words in Apertium''''''''' |
||
'''Deliverables: ''' <br /> |
'''Deliverables: ''' <br /> |
Revision as of 15:52, 7 April 2011
Detect 'hidden' unknown words in Apertium
Email: xmujay@gmail.com
Short description:
This proposal proposes an approach of using a two step strategy to effectively detect 'hidden' unknown words. In this approach, the first step uses context-independent phoneme HMMs to recognize registered words and phoneme-cluster HMMs to detect unknown words. In the second step context-dependent phone models are used for precise recognition where unknown words are transcribed.
Name: Chen Xiajian
Education: PHD, Chinese Academy of Science
Email: xmujay@gmail.com
Project Title:
Detect 'hidden' unknown words in Apertium
Short description:
The part-of-speech tagger of Apertium can be modified to work out the likelihood of each 'tag' in a certain context; this can be used to detect missing entries in the dictionary. This proposal proposes an approach of using a two step strategy to effectively detect 'hidden' unknown words.
In this approach, the first step uses context-independent phoneme HMMs to recognize registered words and phoneme-cluster HMMs to detect unknown words.
In the second step context-dependent phone models are used for precise recognition where unknown words are transcribed.
Benefits to the Apertium :
Detect 'hidden' unknown words in Apertium''''
Deliverables:
We plan to achieve this goal:
In sentence recognition experiments using this unknown-word processing, phoneme cluster models that consider the Chinese syllabic construction achieved a higher word accuracy rate of 70.3%, compared with 59.2% for sentence recognition without this processing. Furthermore, the amount of processing was reduced by about half compared with a detection method using phoneme HMMs. The total system achieves a 75.2% phoneme accuracy rate including the transcription of unknown words.
Project Details:
Table of Contents:
1 Two step algorithm to Detect 'hidden' unknown words
2 Unknown word processing
3 Project Schedule
4 Reward in the international OW2 Open Source competition(Third Prize, Top10)
5 My resume
1. Two step algorithm to Detect 'hidden' unknown words
Addressing these problems, my approach uses phoneme cluster hidden Markov Models (HMMs) and a Part-Of-Speech (POS)-dependent language model to effectively detect unknown words and to minimize the required increase in detection processing.
The cluster model is considered as a kind of garbage model, which is sometimes used in the field of keyword spotting. This method increases the processing time only slightly because of the smaller number of additional acoustic models needed to cope with all unknown words.
We propose two types of phoneme clusters, where one is based on linguistic construction and the other is based on acoustic similarity, and propose the following two-step recognition strategy to process unknown words using this detection approach involving the transcription of unknown words.
Ø First step: Detect unknown word sections using context-independent phoneme cluster HMMs and recognize registered words using context independent phoneme HMMs. A context-free grammar (CFG) is used for achieving a higher performance in the recognition of registered words and in the generation of unknown word candidates. A POS-dependent phoneme-cluster n-gram mode is also used.
Ø Second step: Transcribe the detected sections of unknown words with phoneme sequences and re-recognize the registered words using more accurate models such as context-dependent phoneme HMMs. In this step, N-best candidates derived from the 1st step are re-recognized and a POS-dependent phoneme n-gram is used.
2. Unknown word processing
2.1. Candidates generation
Because applying syntactic grammatical constraints is one of the most effective methods for achieving higher performance, a CFG is used in this system. It is necessary to integrate rules that enable the generation of unknown words into the grammar.
We assume that words appearing frequently have already been registered in the lexicon and that words appearing infrequently have to be processed as unknown words. Accordingly, we investigated the occurrence frequency of all words in the text data of a "conference registration" task. It was found that 44.9% of all different words were one-frequency words, and that nouns accounted for 65.2% of these words. Based on this result, we decided to use an initial CFG, in which unknown-word generation rules are included as noun parts. Clearly this generation method is applicable to other POSs. The total mixture of the above three kinds of cluster models was set equal for comparison purposes.
2.2. Acoustic models for unknown words
For unknown word detection, three kinds of Cluster sets and a kind of phoneme models were compared. In the following Type-1 and Type-2 have the proposed cluster models.
Ø Single cluster (garbage) model (Type-1)
All phonemes were unified as a single cluster model (4-state, 45-mixture). This extreme cluster model does not increase processing much, but it also cannot use a language model for unknown words.
Ø Cluster models based on linguistic construction (Type-2)
One cluster model (4-state 5-mixture) unified from 18 consonants and eight phoneme models (4-state, 5-mixture) including all vowels were used. This minimal division was designed additionally taking into account the Japanese syllabic construction
Ø Cluster models based on acoustic similarity (Type-3)
Nine cluster models (4-states, 5-mixture) were generated by automatically splitting the single cluster model of Type-1 into cluster classes based on a clustering method (successive state splitting, SSS).
2.3. Language models for unknown words
To detect unknown words effectively in the first pass, a POS-dependent cluster or phoneme n-gram was used for Type-2, Type-3.
Project Schedule :
Task
Time Needed
Description
Learning time
Two weeks or less
Learn more about algorithm
Read source code of preview module in the instruction of Apertium in depth.
Design the data structure
Implementation Time
One month
Implement the algorithm with the guidance of mentor.
Debug Time
Two weeks
Test to public(if possible) or in private
Doc Time
Two weeks or less
Docs about what GsoC needed.
If time permit, doc about Algorithm should be written.
Refactor Time
Two weeks or during the whole time of the GSoc
Refactor preview module with mentor.
Some of my Rewards & Certificates(OW2 Open Source Competion)
The following is some of my rewards and certificates :
1 Reward in the international OW2 Open Source competion(Third Prize, Top10)
2 Reward in the VMware Cloud Computing Creative Competiton (Excellent Works, Top10)
My Resume
Xiajian Chen
Tel: 086-15811006153 Email: xmujay@gmail.com
▲Education
2008.09~Now Institute of Software, Chinese Academy of Sciences Master Rank:2/108(2%)
2004.09~2008.07 Computer Science and Technology, Xiamen University Bachelor Rank:3/250(1%)
▲Scholarships & honor
2010.06 Excellent works award of VMware Chinese Clouds Computing Creative Competition(VMware, top50);
2010.05 Third prize award of Baidu Star Programming Contest, The semi-finals (Baidu, top400);
2010.03 National excellent works award of Tencent Internet development competition(Tencent, 1%);
2009.08 Certification of International Junior Achievement Organization "Career Go!” &Excellent Students (JA, 1%)
2008.10 Third prize award of National College Challenge Cup final (Communist Youth League, Ministry of Education, 1%);
2005~2009 Award of Excellent Student and national scholarship in five consecutive annual (XMU、ISCAS 2%)
2006.07 First prize award of ACM Programming Contest in Xiamen University; First prize in National Chemistry Olympiad
▲Intern Experience
2009.03~2009.09 Nokia Research Institute of China
The Research and Implementation of sketch animation and skeletal animation in Nokia smart phones(N800&N900)
2006.07~2008.07 Xiamen LongTop Software Co., Ltd.
Defect Management System for Xiamen LongTop Software Co., Ltd.
Project Experience
2008.08~2009.05 Distributed Resources Management System Based on Hadoop
Description: Provides a platform for sharing resources. Deal with the large-scale log data analysis and recommendation based on Hadoop distributed framework and Map-Reduce algorithm.
Ø Design the Software framework using Spring, Struts, Hibernate(J2EE Platform) ; Ø Implement the distributed framework through Hadoop, using Map-Reduce algorithms to resources recommendations; Ø Design the visual tag and recommendation model using Ajax, JSon and JQuery to improve user experience ; Ø Using Lucene to index data to accelerate query speed, and ICTCLAS to improve query accuracy; Tools: Hadoop Map-Reduce、Spring、Struts、Hibernate、Ajax、JSon、JQuery、Lucene 2008.11~2009.03 Nokia Institute: 2D Sketch Animation Tool Based on Nokia Smartphone Description:Sketch animation tool in mobile devices. Support modeling with free handwriting. Users can create animation through motion path, deformation, and synchronization of basic action. The software was accepted by Nokia institute in China and demonstrated in Nokia International experience conference.
Ø Design the software architecture platform-independent and implement of functions including constructing tree structure of the sketch model, track-based and deformation animation, animations synchronization; Ø Solved two crucial technical problems while transplanting to Nokia N900: double buffering and transparent windows Ø Utilization of MVC framework, and some design patterns such as Command, Status, Strategy, Factory, Observer, make the codes more readable, and increase the system maintainability and scalability. Ø Implementation of the response automation in Interactive Interface, which enhances the stability of the software. 2009.04~2009.12 Performance-driven Skeletal animation for Mobile device
Description: Aims at lowering the barrier of character animation authoring, combined the character's pose and the motion. And introduce motion retargeting technology to further simplify the animation process.
Ø Represent model in tree structure using AP method, allow to draw characters with arbitrary topology; Ø Combine the positive kinematics with inverse kinematics to produce intermediate process of posture movement, achieve a low complexity of the algorithm which increases rate of 25%; Porting to Nokia N900; Ø Use the space frame interpolation to generate posture movement and make two motions simultaneously 2009.10~Now National Key 863 Project: Graphical Programming for children
Description:Provides visual blocks to represent code in order to help the children to master programming skills.
Ø Requirement Study and Design and development eight types of premier code blocks include loop and condition block; Ø Complete code block splicing and grammar detection, and execution of the program modules. Ø Plan to add tangible programming with Camera to collect code and using RFID to enhance the flexibility; ▲English & technology
English: CET-4 passed in freshman year, CET-6 passed in sophomore year, with good reading and writing skills.
Computer:♦ Strong skills in C++ and VC (MFC) and STL (5 years project experience in development);
♦ Strong skills in OOP and Design patterns, Familiar withdata structure, and algorithms;
♦ Strong skills in the STL and Template generic programming, study the STL source; Familiar with Linux shell and Oracle;
♦ Familiar with Hadoop distributed framework and the Map-Reduce algorithms. Familiar with Spring、Struts、Hibernate;
▲Social Practice
2008.10~2009.10 Chairman of Microsoft innovation association in Chinese academy of sciences
2008.10 Participate in Xiamen International Marathon; finish the course 42.195 km (No.421)
Have you developed any software simply because it was fun? What was it?
Yes. A project about Distributed Resources Management System Based on Hadoop
Detail: Provides a platform for sharing resources. Deal with the large-scale log data analysis and recommendation based on Hadoop distributed framework and Map-Reduce algorithm.
Ø Design the Software framework using Spring, Struts, Hibernate(J2EE Platform) ; Ø Implement the distributed framework through Hadoop, using Map-Reduce algorithms to resources recommendations; Ø Design the visual tag and recommendation model using Ajax, JSon and JQuery to improve user experience ; Ø Using Lucene to index data to accelerate query speed, and ICTCLAS to improve query accuracy;