Difference between revisions of "User:Pankajksharma/Patcher"
(8 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
Wiki page for GSoC work done by pankajksharma during GSoC 2014. |
Wiki page for GSoC work done by [[User:pankajksharma | pankajksharma]] during GSoC 2014. |
||
Over here a screen shot of original algorithm along with how to use it is provided. |
|||
Application http://wiki.apertium.org/wiki/User:Pankajksharma/Application |
|||
See the [[User:Pankajksharma/Application|Application]], original idea page: [[Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair]] |
|||
For installation instructions, please visit: http://wiki.apertium.org/wiki/User:Pankajksharma/Patcher_Installation |
For installation instructions, please visit: http://wiki.apertium.org/wiki/User:Pankajksharma/Patcher_Installation |
||
You can clone the project (using git) from https://github.com/pankajksharma/py-apertium |
|||
More information will be added soon. |
|||
==Apertium Python Patcher== |
==Apertium Python Patcher== |
||
For a given pair of sentences (S, S') having a Fuzzy match score above a certain threshold (simply put, similar to each other) in one natural language (say s). If T is the translation of S (say in another natural language t), the job of this patcher is to obtain T' i.e., the corresponding translation of S'. |
|||
Takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of all possible pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words). |
|||
See http://wiki.apertium.org/wiki/User:Pankajksharma/Application#Proposal for more detail. |
See http://wiki.apertium.org/wiki/User:Pankajksharma/Application#Proposal for more detail. |
||
===Algorithm=== |
===Algorithm=== |
||
This algorithm for on the fly patching is developed by [[User:mlforcada|Mikel L. Forcada]] and [[User:pankajksharma|Pankaj K. Sharma]]. |
|||
[[File:on_the_fly_patcher.png]] |
[[File:on_the_fly_patcher.png]] |
||
===Heuristics for best match=== |
|||
Currently the overall sum of length of phrases used to obtain the patch is used to guess the best possible patch. This sum is expected to represent the degree of coverage each possible patch has obtained. |
|||
===The grounding tag=== |
|||
This patcher provides a --go (grounded only) option. This option only patches when the mismatch is covered from both sides. For example, ('he went there', 'he wanted to go there') is a grounded phrase pair but ('he went there', 'she went there') is not. |
|||
Grounding (--go) requires min-len (minimum length of phrase) to be more than 1. Otherwise, there no patching would be possible. |
|||
==Scripts and how to run== |
|||
The project consists of multiple scripts. Here we've tried to provide a brief introduction, how to use and explain all possible options. Kindly reach the author if you face any difficulty. |
|||
===On the fly patching (repair.py)=== |
===On the fly patching (repair.py)=== |
||
usage: repair.py [-h] [- |
usage: repair.py [-h] [-v] [-t] [-c C] [-d D] [--cam] [--go] |
||
[--max-len MAX_LEN] |
[--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] |
||
S T S1 LP |
S T S1 LP |
||
This is the main script which does on the fly patching for given (S, S1, T, LP). |
|||
On the fly repairing of sentence. |
|||
positional arguments: |
positional arguments: |
||
Line 42: | Line 54: | ||
-h, --help show this help message and exit |
-h, --help show this help message and exit |
||
- |
-v Verbose Mode |
||
-t Show patching traces |
|||
-c C.db Specify the sqlite3 db to be used for caching |
|||
-d D Specify the language-pair installation directory |
|||
--cam Only those patches which cover all the mismatches |
|||
--go To patch only grounded mismatches |
|||
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
||
Line 49: | Line 71: | ||
--max-len MAX_LEN Maximum length of sub-segment allowed. |
--max-len MAX_LEN Maximum length of sub-segment allowed. |
||
example: python repair.py "the black dog was barking whole night" "el perro negro ladraba noche entera" "the black cat was barking whole night" en-es -v |
|||
===tmx_patcher.py=== |
|||
Reads Translation Memory (TM) and tries to patch given sentence (S) with the help of best matching sentence available in TM |
|||
usage: tmx_patcher.py [-h] [-v] [-t] [-c C] [-d D] [--cam] [--go] |
|||
[--min-fms MIN_FMS] [--min-len MIN_LEN] |
|||
[--max-len MAX_LEN] |
|||
TM S LP |
|||
positional arguments: |
|||
TM Translation Memory |
|||
S Second Sentence |
|||
LP Language Pair for TM (for example en-eo) |
|||
optional arguments: |
|||
-h, --help show this help message and exit |
|||
-v Verbose Mode |
|||
-t Show patching traces |
|||
-c C Specify the sqlite3 db to be used for caching |
|||
-d D Specify the language-pair installation directory |
|||
--cam Only those patches which cover all the mismatches |
|||
--go To patch only grounded mismatches |
|||
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
|||
--min-len MIN_LEN Minimum length of sub-segment allowed. |
|||
--max-len MAX_LEN Maximum length of sub-segment allowed. |
|||
Example: python tmx_patcher.py /tmp/ca-en-short.tmx "Les úniques respostes útils són les que creen preguntes." ca-en --min-fms 0.8 --cam --min-len 1 --max-len 3 -v |
|||
===fms.py=== |
===fms.py=== |
||
usage: fms.py [-h] S S1 |
usage: fms.py [-h] S S1 |
||
Provides FMS of strings S and S1 using Wagner-Fischer algorithm. |
Provides FMS of strings S and S1 using Wagner-Fischer algorithm. |
||
Line 68: | Line 133: | ||
===reg_test.py=== |
===reg_test.py=== |
||
Regression test for our patcher |
Regression test for our patcher (repair.py). This script takes Out (the output of preprocess.py) and LP (the name of language pair). |
||
usage: reg_test.py [-h] [-d D] [-v] [--mode MODE] [-- |
usage: reg_test.py [-h] [-d D] [-c C] [-v] [--mode MODE] [--go] |
||
[--min-len MIN_LEN] [--max-len MAX_LEN] |
[--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] |
||
out LP |
out LP |
||
positional arguments: |
positional arguments: |
||
out Output file generated from |
out Output file generated from test.py |
||
LP Language Pair (sl-tl) |
LP Language Pair (sl-tl) |
||
Line 85: | Line 149: | ||
-h, --help show this help message and exit |
-h, --help show this help message and exit |
||
-d D Specify the |
-d D Specify the language-pair installation directory |
||
-c C Specify the sqlite3 db to be used for caching |
|||
-v Verbose Mode |
-v Verbose Mode |
||
--mode MODE Modes('all', 'cam', 'compare') |
--mode MODE Modes('all', 'cam', 'compare') |
||
--go To patch only grounded mismatches |
|||
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
||
Line 96: | Line 164: | ||
--max-len MAX_LEN Maximum length of sub-string allowed. |
--max-len MAX_LEN Maximum length of sub-string allowed. |
||
Script understands following modes: |
Script understands following modes: |
||
Line 103: | Line 172: | ||
--cam Includes only those sentences which covers all mismatches |
--cam Includes only those sentences which covers all mismatches |
||
--compare Compares all |
--compare Compares all results for above two modes (verbose doesn't work in this mode) |
||
usage: python reg_test.py pairs/en-es.pairs en-es --mode compare |
|||
===preprocess.py=== |
===preprocess.py=== |
||
Pre process the corpus for generating input for reg_test |
|||
usage: preprocess.py [-h] [-v] [--min-fms MIN_FMS] [--max-len MAX_LEN] |
usage: preprocess.py [-h] [-v] [--min-fms MIN_FMS] [--max-len MAX_LEN] |
||
SLF TLF SLFT TLFT OUT |
SLF TLF SLFT TLFT OUT |
||
Line 135: | Line 204: | ||
--max-len MAX_LEN Maximum length of sentences allowed (default 25) |
--max-len MAX_LEN Maximum length of sentences allowed (default 25) |
||
example: python preprocess.py |
example: python preprocess.py en.en-es.train es.en-es.train en.en-es.testset es.en-es.test en-es.pairs |
||
Line 142: | Line 211: | ||
Calculates and show a histogram of the distribution of FMS between pair of sentences present in corpus F. |
Calculates and show a histogram of the distribution of FMS between pair of sentences present in corpus F. |
||
usage: file_stats.py [-h] [--min-fms MIN_FMS] F |
usage: file_stats.py [-h] [--min-fms MIN_FMS] F |
||
positional arguments: |
positional arguments: |
||
Line 156: | Line 225: | ||
===stats.py=== |
===stats.py=== |
||
usage: stats.py [-h] [-d D] [--min-fms MIN_FMS] [--min-len MIN_LEN] |
usage: stats.py [-h] [-d D] [--min-fms MIN_FMS] [--min-len MIN_LEN] |
||
[--max-len MAX_LEN] |
[--max-len MAX_LEN] |
||
D |
D |
||
Calculates FMS distribution for all corpses present in directory D. |
|||
positional arguments: |
positional arguments: |
||
Line 170: | Line 239: | ||
-h, --help show this help message and exit |
-h, --help show this help message and exit |
||
-d D Specify the |
-d D Specify the language-pair installation directory |
||
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
||
Line 178: | Line 247: | ||
--max-len MAX_LEN Maximum length of sub-string allowed. |
--max-len MAX_LEN Maximum length of sub-string allowed. |
||
===Set A generator=== |
===Set A generator (deprecated)=== |
||
usage: A_generator.py [-h] [--min-fms MIN_FMS] [--min-len MIN_LEN] |
usage: A_generator.py [-h] [--min-fms MIN_FMS] [--min-len MIN_LEN] |
||
[--max-len MAX_LEN] |
[--max-len MAX_LEN] |
||
S S1 |
S S1 |
||
Line 211: | Line 280: | ||
===Set D generator=== |
===Set D generator (deprecated)=== |
||
usage: D_generator.py [-h] [-d D] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] |
|||
S T S1 LP |
|||
Generates set D. |
Generates set D. |
||
Line 232: | Line 301: | ||
-h, --help show this help message and exit |
-h, --help show this help message and exit |
||
-d D Specify the |
-d D Specify the language-pair installation directory |
||
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1. |
Latest revision as of 08:30, 2 June 2016
Wiki page for GSoC work done by pankajksharma during GSoC 2014.
Over here a screen shot of original algorithm along with how to use it is provided.
See the Application, original idea page: Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair
For installation instructions, please visit: http://wiki.apertium.org/wiki/User:Pankajksharma/Patcher_Installation
You can clone the project (using git) from https://github.com/pankajksharma/py-apertium
Apertium Python Patcher[edit]
For a given pair of sentences (S, S') having a Fuzzy match score above a certain threshold (simply put, similar to each other) in one natural language (say s). If T is the translation of S (say in another natural language t), the job of this patcher is to obtain T' i.e., the corresponding translation of S'.
See http://wiki.apertium.org/wiki/User:Pankajksharma/Application#Proposal for more detail.
Algorithm[edit]
This algorithm for on the fly patching is developed by Mikel L. Forcada and Pankaj K. Sharma.
Heuristics for best match[edit]
Currently the overall sum of length of phrases used to obtain the patch is used to guess the best possible patch. This sum is expected to represent the degree of coverage each possible patch has obtained.
The grounding tag[edit]
This patcher provides a --go (grounded only) option. This option only patches when the mismatch is covered from both sides. For example, ('he went there', 'he wanted to go there') is a grounded phrase pair but ('he went there', 'she went there') is not.
Grounding (--go) requires min-len (minimum length of phrase) to be more than 1. Otherwise, there no patching would be possible.
Scripts and how to run[edit]
The project consists of multiple scripts. Here we've tried to provide a brief introduction, how to use and explain all possible options. Kindly reach the author if you face any difficulty.
On the fly patching (repair.py)[edit]
usage: repair.py [-h] [-v] [-t] [-c C] [-d D] [--cam] [--go] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] S T S1 LP
This is the main script which does on the fly patching for given (S, S1, T, LP).
positional arguments:
S Second Sentence
T First Sentence Translation
S1 Second Sentence
LP Language Pair
optional arguments:
-h, --help show this help message and exit
-v Verbose Mode
-t Show patching traces
-c C.db Specify the sqlite3 db to be used for caching
-d D Specify the language-pair installation directory
--cam Only those patches which cover all the mismatches
--go To patch only grounded mismatches
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1.
--min-len MIN_LEN Minimum length of sub-segment allowed.
--max-len MAX_LEN Maximum length of sub-segment allowed.
example: python repair.py "the black dog was barking whole night" "el perro negro ladraba noche entera" "the black cat was barking whole night" en-es -v
tmx_patcher.py[edit]
Reads Translation Memory (TM) and tries to patch given sentence (S) with the help of best matching sentence available in TM
usage: tmx_patcher.py [-h] [-v] [-t] [-c C] [-d D] [--cam] [--go] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] TM S LP
positional arguments:
TM Translation Memory
S Second Sentence
LP Language Pair for TM (for example en-eo)
optional arguments:
-h, --help show this help message and exit
-v Verbose Mode
-t Show patching traces
-c C Specify the sqlite3 db to be used for caching
-d D Specify the language-pair installation directory
--cam Only those patches which cover all the mismatches
--go To patch only grounded mismatches
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1.
--min-len MIN_LEN Minimum length of sub-segment allowed.
--max-len MAX_LEN Maximum length of sub-segment allowed.
Example: python tmx_patcher.py /tmp/ca-en-short.tmx "Les úniques respostes útils són les que creen preguntes." ca-en --min-fms 0.8 --cam --min-len 1 --max-len 3 -v
fms.py[edit]
usage: fms.py [-h] S S1
Provides FMS of strings S and S1 using Wagner-Fischer algorithm.
positional arguments:
S First Sentence
S1 Second Sentence
optional arguments:
-h, --help show this help message and exit
reg_test.py[edit]
Regression test for our patcher (repair.py). This script takes Out (the output of preprocess.py) and LP (the name of language pair).
usage: reg_test.py [-h] [-d D] [-c C] [-v] [--mode MODE] [--go] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] out LP
positional arguments:
out Output file generated from test.py
LP Language Pair (sl-tl)
optional arguments:
-h, --help show this help message and exit
-d D Specify the language-pair installation directory
-c C Specify the sqlite3 db to be used for caching
-v Verbose Mode
--mode MODE Modes('all', 'cam', 'compare')
--go To patch only grounded mismatches
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1.
--min-len MIN_LEN Minimum length of sub-string allowed.
--max-len MAX_LEN Maximum length of sub-string allowed.
Script understands following modes:
--all Includes all types of patched sentences
--cam Includes only those sentences which covers all mismatches
--compare Compares all results for above two modes (verbose doesn't work in this mode)
usage: python reg_test.py pairs/en-es.pairs en-es --mode compare
preprocess.py[edit]
Pre process the corpus for generating input for reg_test
usage: preprocess.py [-h] [-v] [--min-fms MIN_FMS] [--max-len MAX_LEN] SLF TLF SLFT TLFT OUT
positional arguments:
SLF Source Language file for training
TLF Target Language file for training
SLFT Source Language file for testing
TLFT Target Language file for testing
OUT Output file for saving pairs
optional arguments:
-h, --help show this help message and exit
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1(default 0.8)
--max-len MAX_LEN Maximum length of sentences allowed (default 25)
example: python preprocess.py en.en-es.train es.en-es.train en.en-es.testset es.en-es.test en-es.pairs
file_stats.py[edit]
Calculates and show a histogram of the distribution of FMS between pair of sentences present in corpus F.
usage: file_stats.py [-h] [--min-fms MIN_FMS] F
positional arguments:
F Corpus path.
optional arguments:
-h, --help show this help message and exit
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1.
stats.py[edit]
usage: stats.py [-h] [-d D] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] D
Calculates FMS distribution for all corpses present in directory D.
positional arguments:
D Corpus directory.
optional arguments:
-h, --help show this help message and exit
-d D Specify the language-pair installation directory
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1.
--min-len MIN_LEN Minimum length of sub-string allowed.
--max-len MAX_LEN Maximum length of sub-string allowed.
Set A generator (deprecated)[edit]
usage: A_generator.py [-h] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] S S1
Generates set A.
positional arguments:
S First Sentence
S1 Second Sentence
optional arguments:
-h, --help show this help message and exit
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1.
--min-len MIN_LEN Minimum length of sub-string allowed.
--max-len MAX_LEN Maximum length of sub-string allowed.
Example: python A_generator.py "some string" "some another string" --min-fms=0.6 --min-len=1 --max-len=3
Expected Output:
("some string", "some another string")
("string", "another string")
Set D generator (deprecated)[edit]
usage: D_generator.py [-h] [-d D] [--min-fms MIN_FMS] [--min-len MIN_LEN] [--max-len MAX_LEN] S T S1 LP
Generates set D.
positional arguments:
S Second Sentence
T First Sentence Translation
S1 Second Sentence
LP Language Pair
optional arguments:
-h, --help show this help message and exit
-d D Specify the language-pair installation directory
--min-fms MIN_FMS Minimum value of fuzzy match score of S and S1.
--min-len MIN_LEN Minimum length of sub-string allowed.
--max-len MAX_LEN Maximum length of sub-string allowed.
Example: python D_generator.py "he changed his number recently" "Va canviar el seu número recentment" "he changed his address recently" en-ca
("Va canviar el seu", "Va canviar la seva adreça")
("Va canviar el seu número", "Va canviar el seu")
("Va canviar el seu número", "Va canviar la seva adreça")
("Va canviar el seu número recentment", "Va canviar la seva adreça recentment")
("El seu número", "El seu")
("El seu número", "La seva adreça")
("El seu número recentment", "La seva adreça recentment")
("Número recentment", "Recentment")
("Número recentment", "Adreça recentment")
pre_gsoc[edit]
For pre-SoC wok see [pre-soc/](https://github.com/pankajksharma/py-apertium/tree/master/pre_soc)