Difference between revisions of "User:Pankajksharma/Patcher"

Revision as of 17:38, 20 August 2014

Wiki page for GSoC work done by pankajksharma during GSoC 2014.

Over here a screenshot of original algorithm along with how to use it is provided.

Application http://wiki.apertium.org/wiki/User:Pankajksharma/Application

For installation instructions, please visit: http://wiki.apertium.org/wiki/User:Pankajksharma/Patcher_Installation

You can clone the project (using git) from https://github.com/pankajksharma/py-apertium

Apertium Python Patcher

For a given pair of sentences (S, S') having a Fuzzy match score above a certain threshold (simply put, similar to each other) in one natural language (say s). If T is the translation of S (say in another natural lanuguage t), the job of this patcher is to obtain T' i.e., the corresponding transaltion of S'.

See http://wiki.apertium.org/wiki/User:Pankajksharma/Application#Proposal for more detail.

Algorithm

This algorithm for on the fly patching is developed by Mikel L. Forcada and Pankaj K. Sharma.

Heuristics for best match

Currently the overall sum of length of phrases used to obtain the patch is used to guess the best possible patch. This sum is expected to represent the degree of coverage each possible patch has obtained.

The grounding tag

This patcher provides a --go (grounded only) option. This option only patches when the mismatch is covered from both sides. For example, ('he went there', 'he wanted to go there') is a grounded phrase pair but ('he went there', 'she went there') is not.

Grounding (--go) requires min-len (minimum length of phrase) to be more than 1. Otherwise, there no patching would be possible.

Scripts and how to run

The project consists of multiple scripts. Here we've tried to provide a brief introduction, how to use and explain all possible options. Kindly reach the author if you face any difficulty.

On the fly patching (repair.py)

         usage: repair.py [-h] [-v] [-t] [-c C] [-d D] [--cam] [--go]
                [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN]
                S T S1 LP

This is the main script which does on the fly patching for given (S, S1, T, LP).

positional arguments:

 S                  Second Sentence

 T                  First Sentence Translation

 S1                 Second Sentence

 LP                 Language Pair

optional arguments:

 -h, --help         show this help message and exit

 -v                 Verbose Mode

 -t                 Show patching traces

 -c     C.db        Specify the sqlite3 db to be used for caching

 -d     D           Specify the lanuguage-pair installation directory

 --cam              Only those patches which cover all the mismatches

 --go               To patch only grounded mismatches

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1.

 --min-len MIN_LEN  Minimum length of sub-segment allowed.

 --max-len MAX_LEN  Maximum length of sub-segment allowed.

example: python repair.py "the black dog was barking whole night" "el perro negro ladraba noche entera" "the black cat was barking whole night" en-es -v

tmx_patcher.py

Reads Translation Memory (TM) and tries to patch given sentence (S) with the help of best matching sentence available in TM

       usage: tmx_patcher.py [-h] [-v] [-t] [-c C] [-d D] [--cam] [--go]
                     [--min-fms MIN_FMS] [--min-len MIN_LEN]
                     [--max-len MAX_LEN]
                     TM S LP

positional arguments:

 TM                 Translation Memory

 S                  Second Sentence

 LP                 Language Pair for TM (for example en-eo)

optional arguments:

 -h, --help         show this help message and exit

 -v                 Verbose Mode

 -t                 Show patching traces

 -c C               Specify the sqlite3 db to be used for caching

 -d D               Specify the lanuguage-pair installation directory

 --cam              Only those patches which cover all the mismatches

 --go               To patch only grounded mismatches

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1.

 --min-len MIN_LEN  Minimum length of sub-segment allowed.

 --max-len MAX_LEN  Maximum length of sub-segment allowed.

Example: python tmx_patcher.py /tmp/ca-en-short.tmx "Les úniques respostes útils són les que creen preguntes." ca-en --min-fms 0.8 --cam --min-len 1 --max-len 3 -v

fms.py

       usage: fms.py [-h] S S1

Provides FMS of strings S and S1 using Wagner-Fischer algorithm.

positional arguments:

 S           First Sentence

 S1          Second Sentence

optional arguments:

 -h, --help  show this help message and exit

reg_test.py

Regression test for our patcher (repair.py). This script takes Out (the output of preprocess.py) and LP (the name of language pair).

       usage: reg_test.py [-h] [-d D] [-c C] [-v] [--mode MODE] [--go]
                  [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN]
                  out LP

positional arguments:

 out                Output file generated from test.py

 LP                 Language Pair (sl-tl)

optional arguments:

 -h, --help         show this help message and exit

 -d D               Specify the lanuguage-pair installation directory

 -c C               Specify the sqlite3 db to be used for caching

 -v                 Verbose Mode

 --mode MODE        Modes('all', 'cam', 'compare')

 --go               To patch only grounded mismatches

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1.

 --min-len MIN_LEN  Minimum length of sub-string allowed.

 --max-len MAX_LEN  Maximum length of sub-string allowed.

Script understands following modes:

--all Includes all types of patched sentences

--cam Includes only those sentences which covers all mismatches

--compare Compares all reults for above two modes (verbose doesn't work in this mode)

       usage: python reg_test.py pairs/en-es.pairs en-es --mode compare

preprocess.py

Preprocess the corpus for generating input for reg_test

       usage: preprocess.py [-h] [-v] [--min-fms MIN_FMS] [--max-len MAX_LEN]
                    SLF TLF SLFT TLFT OUT

positional arguments:

 SLF                Source Language file for training

 TLF                Target Language file for training

 SLFT               Source Language file for testing

 TLFT               Target Language file for testing

 OUT                Output file for saving pairs

optional arguments:

 -h, --help         show this help message and exit

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1(default 0.8)

 --max-len MAX_LEN  Maximum length of sentences allowed (default 25)

 example: python preprocess.py en.en-es.train es.en-es.train en.en-es.testset es.en-es.test en-es.pairs

file_stats.py

Calculates and show a histogram of the distribution of FMS between pair of sentences present in corpus F.

       usage: file_stats.py [-h] [--min-fms MIN_FMS] F

positional arguments:

 F                  Corpus path.

optional arguments:

 -h, --help         show this help message and exit

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1.

stats.py

       usage: stats.py [-h] [-d D] [--min-fms MIN_FMS] [--min-len MIN_LEN]
               [--max-len MAX_LEN]
               D

Calulates FMS distribtution for all corpuses pressent in directory D.

positional arguments:

 D                  Corpus directory.

optional arguments:

 -h, --help         show this help message and exit

 -d D               Specify the lanuguage-pair installation directory

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1.

 --min-len MIN_LEN  Minimum length of sub-string allowed.

 --max-len MAX_LEN  Maximum length of sub-string allowed.

Set A generator (deprecated)

       usage: A_generator.py [-h] [--min-fms MIN_FMS] [--min-len MIN_LEN]
                     [--max-len MAX_LEN]
                     S S1

Generates set A.

positional arguments:

 S                  First Sentence

 S1                 Second Sentence

optional arguments:

 -h, --help         show this help message and exit

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1.

 --min-len MIN_LEN  Minimum length of sub-string allowed.

 --max-len MAX_LEN  Maximum length of sub-string allowed.

Example: python A_generator.py "some string" "some another string" --min-fms=0.6 --min-len=1 --max-len=3

Expected Output:

("some string", "some another string")

("string", "another string")

Set D generator (deprecated)

       usage: D_generator.py [-h] [-d D] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN]
       S T S1 LP

Generates set D.

positional arguments:

 S                  Second Sentence

 T                  First Sentence Translation

 S1                 Second Sentence

 LP                 Language Pair

optional arguments:

 -h, --help         show this help message and exit

 -d D               Specify the lanuguage-pair installation directory

 --min-fms MIN_FMS  Minimum value of fuzzy match score of S and S1.

 --min-len MIN_LEN  Minimum length of sub-string allowed.

 --max-len MAX_LEN  Maximum length of sub-string allowed.

Example: python D_generator.py "he changed his number recently" "Va canviar el seu número recentment" "he changed his address recently" en-ca

("Va canviar el seu", "Va canviar la seva adreça")

("Va canviar el seu número", "Va canviar el seu")

("Va canviar el seu número", "Va canviar la seva adreça")

("Va canviar el seu número recentment", "Va canviar la seva adreça recentment")

("El seu número", "El seu")

("El seu número", "La seva adreça")

("El seu número recentment", "La seva adreça recentment")

("Número recentment", "Recentment")

("Número recentment", "Adreça recentment")

pre_gsoc

For pre-SoC wok see [pre-soc/](https://github.com/pankajksharma/py-apertium/tree/master/pre_soc)

@@ Line 26: / Line 26: @@
 This patcher provides a --go (grounded only) option. This option only patches when the mismatch is covered from both sides. For example, ('he went there', 'he wanted to go there') is a grounded phrase pair but ('he went there', 'she went there') is not.
-Grounding requires min-len (minimum length of phrase) to be atleast more than 1. Otherwise, there's no chance of patching.
+Grounding (--go) requires min-len (minimum length of phrase) to be more than 1. Otherwise, there no patching would be possible.
 ==Scripts and how to run==

Difference between revisions of "User:Pankajksharma/Patcher"

Revision as of 17:38, 20 August 2014

Contents

Apertium Python Patcher

Algorithm

Heuristics for best match

The grounding tag

Scripts and how to run

On the fly patching (repair.py)

tmx_patcher.py

fms.py

reg_test.py

preprocess.py

file_stats.py

stats.py

Set A generator (deprecated)

Set D generator (deprecated)

pre_gsoc

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools