Difference between revisions of "User:Pankajksharma/Application"

Revision as of 19:00, 18 March 2014

Personal Information

Name: Pankaj Kumar Sharma

E-mail address: sharmapankaj1992@gmail.com

Other information that may be useful to contact:

My alternative email: pankaj@pankajksharma.com

Interest in MT and Apertium

Why is it you are interested in machine translation?

I am interested in Machine Translation (MT) because of two reasons. The first one is little Philosophical one, i.e., the ideology of making all the digital information present openly available to everyone regardless of the language in which it's written or regardless of the language that used by the recipients. Further this would also cause in decreasing the language barrier in the exchange process of ideas.

Second, I did my minor in Text Classification and since then become interested in Machine Learning and took me closer to NLP (a part of MT). To be honest and I've only only used MT only as an end-user until recently.

Why is it that they are interested in the Apertium project?

I am interested in Apertium because:

It's open source.
Very helping community (experienced this from my interaction during project discussion).
All the technique used in Apertium are provided as research papers (so anyone could learn from them).
Apertium works Offline as well (:P).

Proposal

Title

Fuzzy-match repair from Translation Memory

Abstract

For a given sentence S in a source language and it's translation T in another language, the idea is to find the translation of another sentence S'. The condition that S and S' must hold is that S and S' must have high Fuzzy-match score (or Low Edit Distance) between them. Depending upon what changes from S to S' we employ a set of repair operations(t, t') to T to get our T'. Here T' is a possible translation of S' and the pairs (t, t') holds the condition that t is a sub-segment of T and t' is the possible change which leads us to T'.

Another phase of the project is to preprocess an existing translation memory corresponding to the source and target languages and store validated (s,t) pairs (s is a sub-sequence of S, t is a sub-sequence of T and s translates to t). These pairs could be used for generating target more better and verified (s', t') pairs.

This idea was originally given by User:mlforcada.

Project Details

These details are developed after discussions with User:mlforcada and have slight variation during the implementation phase.

We will use following example throughout this section:

 S    "he changed his address recently"

 T    "Va canviar la seva adreça recentment"

 S'   "he changed his number recently"

 LP   "en-ca" (English-Catalan)

Finding fuzzy match score

To find whether the given input source sentences (S and S') are similar to each other, we'll use fuzzy match score of S and S'.

We would use the following method for finding the the fuzzy match score (FMS) between S and S':

FMS(S, S') = 1 - ED(S, S') / max(|S|, |S'|)

ED(S, S') is the edit distance between S and S'. We would employ Levenshtein Distance for sentence for calculating the edit distance.

If only the value of FMS > min-fms(specified by user, default 80%), the program will proceed.

In our example:

ED(S,S') = 1 (since only "address" and "number" differ).

max(|S|, |S'|) = 5

Hence, FMS(S,S') = 0.80 (or 80%).

Since the fms is large enough we'll proceed further.

Please check the coding challenge, to find out how in detail how ED is calculated.

Finding what changed from S to S'

To find out the changes between S and S', we would employ the phrase-extraction algorithm to extract with slight modification to obtain pairs (s, s') where s and s are sub-segments of S and S' respectively and there is some non-alignment them. We'd call the covering set as set A.

The modification would be made only to consider those paise which have one or more mis-match (or non-alignment) and satisfy following condition:

min-len <= |s|,|s'| <= max-len, (min-len, max-len being specified by the user).

In our example:

Pairs [(1, 1), (2, 2), (3, 3), (5, 5)] are same (or aligned) in S and S',

ie., their longest common sequence will contain words present at index i of S and index j of S' for each pair (i,j).

Though the default phrase-extraction algorithm [implemented in the coding challenge] will give more pairs, we'll only consider those pairs which satisfy above given conditions. Say if min-len=2 and max-len=3, then our set A will be:

[("changed his", "changed his number"),

("changed his address", "changed his"),

("changed his address", "changed his number"),

("his address", "his number"),

("his address recently", "his number recently"),

("address recently", "recently"),

("address recently", "number recently"), ...]

Translating what changed from S to S'

For this we'd be using the clipping that we created in above steps, as in (s, s') pairs.

To consider the context of translation, we'd be using double validation, i.e., would be considering those pairs (s, t) which have following properties: s is a sub-segment of S, s contains some mismatch in S and S', t is a sub-segment of T and s translates to t. We'd call covering set as set B.

This we'd be doing using following algorithm:

In our example:

The set B would be:

[("changed his address", "canviar la seva adreça"),

("his address", "la seva adreça"),

("his address recently", "la seva adreça recentment"),

("address recently", "adreça recentment")]

We would have an "-r" option as well that could be used to find more pairs by employing extracting sub-segements from T and finding thier transalations, they would be added in addition to above pairs if the transaltions are sub-segements in S as well.

Translating changes in S'

We'd use Apertium python API (developed in the Coding challenge to obtain pairs (s', t'). These pairs would have following properties: s' is a sub-segment of S', s' carries some variation (between S and S') and s' translates t'. We'd call the covering set as set C.

We'd use following algorithm to find C:

In our Example:

The set C would be:

[("changed his number", "canviar el seu número"),

("changed his", "canviar el seu"),

("his number", "el seu número"),

("his number recently", "el seu número recentment"),

("number recently", "número recentment")]

As stated n above we can use "-r" option to increase chances to getting more pairs.

Obtaining repair pairs

Using sets A, B and C, we'd find pairs (t, t'). As the number of of such pairs could be large so we'd employ some post processing technique to decrease their numbers (like removing subsets). These pairs would be our repair operations, using which we'd try to obtain T'.

Obtaining T'

As the number of pairs in set covering repair operations (t, t') would be more than one. So we'd have to develop a repair policy. We could do this by making use of machine learning bu learning from an example set by calculating values of FMS(T, T') where T is a repaired sentence from it's covering set T* and T' is given translation.

Preprocessing

After a framework is being prepared, we could preprocess an existing translation memory using coding challenge work to get and index a large set of (s,t) that are "doubly validated": on the one hand, t is the MT of s (or s is the MT of t), but on the other hand, they have been observed in your translation memory. In the future, they could be used as "higher quality" (s',t') 's used to build "better" patches for new sentences.

API Call

As the project is allows you to use any scripting language I'll be using Python.

The main program would have following API:

repair.py S S' T LP [--min-fms (default 80)] [--min-len (default 3)] [--max-len (default 3)] [-r] [-s] [-h] [-d Directory]

positional arguments:

 S           Source Language Sentence

 T           Target Language Sentence

 S'          Second Source Language Sentence

 LP          Language Pair (for example 'en-eo')

optional arguments:

 -h, --help  shows help message and exit

 -d D        Specify the language-pair installation directory

 -r          Check for pairs reversibly as well

 -s          Ignore single words

 --min-fms   Minimum Fuzzy match score required to process (default value: 80%)

 --min-len   Minimum length of the sub-segments (default value: 3)

 --max-len   Maximum length of the sub-segments  (default value: 3)

Time line of the Project

We'd use following schedule for executing this process:

Community Interaction Period: I would employ this interval for interacting with Apertium community and project mentors. Apart from this I'd be reading all the existing work that has been done and required algorithms.

What's been done til now ?

FMS calculator
Source-Target sub-segments generator
Phrase extraction Algorithm [basic, need changes]

Remaining Plan:

Week 1: Improving Phrase extraction Algorithm
Week 2: Developing Set B and C generator
Week 3: Developing Repair operations generator
Week 4: Testing and Code clean up
Deliverable #1: Repair Operations generator
Week 5-6: Leaning from examples to develop an heuristic based repairing algorithm
Week 7: Testing above algorithm
Week 8: Testing and Code clean up

Deliverable #2: Fuzzy match repairer

Week 9: Preprocessing (How to store, etc).
Week 10: Testing with some existing Translation Memory
Week 11: Working on (Improving) things that couldn't be completed on time.
Week 12: Code clean up and Documentation (most of that would be along the coding phase).

Project completed

My skills and qualifications

List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.

List any non-Summer-of-Code plans

No, I don't have any other engagement for the Summer and would be more than happy to devote 30+ hours every week for this project.

@@ Line 138: / Line 138: @@
 We'd use Apertium python API (developed in the Coding challenge to obtain pairs (s', t'). These pairs would have following properties: s' is a sub-segment of S', s' carries some variation (between S and S') and s' translates t'. We'd call the covering set as set '''C'''.
+We'd use following algorithm to find '''C''':
+[[File:Algo2.png]]
+'''In our Example''':
+The set '''C''' would be:
+[("changed his number", "canviar el seu número"),
+("changed his", "canviar el seu"),
+("his number", "el seu número"),
+("his number recently", "el seu número recentment"),
+("number recently", "número recentment")]
+As stated n above we can use "-r" option to increase chances to getting more pairs.
 ==== Obtaining repair pairs ====

Difference between revisions of "User:Pankajksharma/Application"

Revision as of 19:00, 18 March 2014

Contents

Personal Information

Interest in MT and Apertium

Proposal

Title

Abstract

Project Details

Finding fuzzy match score

Finding what changed from S to S'

Translating what changed from S to S'

Translating changes in S'

Obtaining repair pairs

Obtaining T'

Preprocessing

API Call

Time line of the Project

My skills and qualifications

List any non-Summer-of-Code plans

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools