Difference between revisions of "User:Gang Chen/GSoC 2013 Progress"

From Apertium
Jump to navigation Jump to search
Line 18: Line 18:
 
2013-05-30: Start.
 
2013-05-30: Start.
   
=== Detailed svn log ===
+
=== Detailed progress ===
   
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
r45280 | elephantgcc | 2013-06-24 14:21:20 +0800 (Mon, 24 Jun 2013) | 1 line
+
2013-06-28 14:21:20 +0800 (Mon, 24 Jun 2013) | 1 line
   
  +
1. add find_similar_ambiguity_class for the tagging procedure. So a new ambiguity class won't crash the tagger down.
MOD: bugfix, normalize not 0.
 
------------------------------------------------------------------------
 
r45277 | elephantgcc | 2013-06-24 10:47:03 +0800 (Mon, 24 Jun 2013) | 1 line
 
   
  +
2. bugfix to the normalize factor. This makes the things right, but no improvements are gained to the quality.
MOD: add find_similar_ambiguity_class, fix bug in morpho_stream
 
------------------------------------------------------------------------
 
r45258 | elephantgcc | 2013-06-23 18:33:52 +0800 (Sun, 23 Jun 2013) | 1 line
 
   
MOD: add rule support for light-sw tagger, partly.
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-23
r45235 | elephantgcc | 2013-06-22 13:53:59 +0800 (Sat, 22 Jun 2013) | 1 line
 
   
MOD: light sliding-window tagger, a basic working version.
+
1. implement a light sliding-window tagger. This tagger is based on the SW tagger, with "parameter reduction" described in the 2005 paper.
------------------------------------------------------------------------
 
r45214 | elephantgcc | 2013-06-21 17:49:14 +0800 (Fri, 21 Jun 2013) | 1 line
 
   
ADD: add light-sw tagger, partly.
+
2. add rule support for light-sw tagger. The rules help to improve the tagging quality.
------------------------------------------------------------------------
 
r45213 | elephantgcc | 2013-06-21 16:19:24 +0800 (Fri, 21 Jun 2013) | 1 line
 
   
MOD: bugfix, ZERO define, init and iteration formula.
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-21
r45209 | elephantgcc | 2013-06-21 13:01:50 +0800 (Fri, 21 Jun 2013) | 1 line
 
   
  +
1. add ZERO define. Because there are some double comparisons, we need a relatively precise threshold.
MOD: refine function names.
 
------------------------------------------------------------------------
 
r45208 | elephantgcc | 2013-06-21 11:14:45 +0800 (Fri, 21 Jun 2013) | 1 line
 
   
  +
2. bug fix for the initial procedure and iteration formula.
MOD: tagger_data write only non-ZERO values.
 
------------------------------------------------------------------------
 
r45194 | elephantgcc | 2013-06-20 19:32:51 +0800 (Thu, 20 Jun 2013) | 1 line
 
   
  +
3. check function style so the new code is consistent to the existing code.
MOD: style check.
 
------------------------------------------------------------------------
 
r45193 | elephantgcc | 2013-06-20 18:44:57 +0800 (Thu, 20 Jun 2013) | 1 line
 
   
MOD: bugfix, avoid '-nan' parameters.
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-20
r45191 | elephantgcc | 2013-06-20 17:07:05 +0800 (Thu, 20 Jun 2013) | 1 line
 
   
  +
1. use heap space for 3-dimensional parameters. This makes it possible to train successfully without manually setting the 'stack' environment.
MOD: add retrain() for sw tagger.
 
------------------------------------------------------------------------
 
r45189 | elephantgcc | 2013-06-20 16:42:30 +0800 (Thu, 20 Jun 2013) | 1 line
 
   
  +
2. add retrain() function for SW tagger. The logic of the SW tagger's retrain is the same as that of the HMM tagger. It append several iterations based on the current parameters.
MOD: bugfix, use heap space for 3-dimensional parameters.
 
------------------------------------------------------------------------
 
r45188 | elephantgcc | 2013-06-20 16:30:03 +0800 (Thu, 20 Jun 2013) | 1 line
 
   
  +
3. bugfix, avoid '-nan' parameters.
MOD: add support for null_flush in sw tagger.
 
------------------------------------------------------------------------
 
r45176 | elephantgcc | 2013-06-19 19:46:29 +0800 (Wed, 19 Jun 2013) | 1 line
 
   
  +
4. the tagger_data write only non-ZERO values. This saves a lot of disk space, reducing the parameter file from 100M to 200k.
MOD: done with EOS, refine training's and tagging's reading control.
 
------------------------------------------------------------------------
 
r45161 | elephantgcc | 2013-06-18 22:23:26 +0800 (Tue, 18 Jun 2013) | 1 line
 
   
MOD: add switch for debug, eos, null_flush.
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-19
r45159 | elephantgcc | 2013-06-18 21:59:48 +0800 (Tue, 18 Jun 2013) | 1 line
 
   
MOD: add print_para_matrix() for debugging in sw tagger.
+
1. add print_para_matrix() for debugging in SW tagger. This funciton only prints non-ZERO parameters in the 3d matrix.
------------------------------------------------------------------------
 
r45116 | elephantgcc | 2013-06-17 19:09:17 +0800 (Mon, 17 Jun 2013) | 1 line
 
   
  +
2. add support for debug, EOS, and null_flush. This makes the tagger work stable when called by other programs.
MOD: let sw tagger end when the input word is NULL.
 
------------------------------------------------------------------------
 
r45031 | elephantgcc | 2013-06-13 10:59:53 +0800 (Thu, 13 Jun 2013) | 1 line
 
   
MOD: change initial tag score from 0 to -1.
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-17
r45029 | elephantgcc | 2013-06-12 21:52:44 +0800 (Wed, 12 Jun 2013) | 1 line
 
   
  +
1. a deep follow into the morpho_stream class, and make sense its memembers and functions.
MOD: add option for sw tagger.
 
  +
------------------------------------------------------------------------
 
  +
2. refine the reading procedure of the SW tagger, so that the procedure is simpler and more stable.
r45017 | elephantgcc | 2013-06-11 21:15:01 +0800 (Tue, 11 Jun 2013) | 1 line
 
   
MOD: add judgement 'morpho_stream.getEndOfFile()', first demo(training and tagging) OK.
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-13
r45016 | elephantgcc | 2013-06-11 20:27:46 +0800 (Tue, 11 Jun 2013) | 1 line
 
  +
  +
1. fix a bug in tagging procedure, where the initial tag score should be -1 instead of 0.
   
MOD: clean up debug info in morpho_stream.cc
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-12
r45015 | elephantgcc | 2013-06-11 20:17:34 +0800 (Tue, 11 Jun 2013) | 1 line
 
  +
  +
1. add option "-w" for sw tagger. The option "-w" is not a drop-in replacement to the current HMM tagger, but an extension. So the default tagger for Apertium will still be the HMM tagger. If the "-w" option is specified, the the SW tagger will be used.
  +
<pre>
  +
HMM tagger usage:
  +
apertium-tagger -t 8 es.dic es.crp apertium-en-es.es.tsx es-en.prob
  +
apertium-tagger -g es-en.prob.new
  +
  +
SW tagger usage:
  +
apertium-tagger -w -t 8 es.dic es.crp apertium-en-es.es.tsx es-en.prob
  +
apertium-tagger -w -g es-en.prob.new
  +
</pre>
  +
  +
2. add support for judging the end of morpho_stream.
  +
  +
3. The first working version is OK:)
   
MOD: clean up debug info in hmm.cc
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-11
r45010 | elephantgcc | 2013-06-11 17:44:56 +0800 (Tue, 11 Jun 2013) | 1 line
 
   
  +
1. Fix the bug of last version, mainly because of the read() method in 'TSX reader'. The read and write methods in the tsx_reader and tagger_data are re-implemented, because they are different from those for the HMM.
MOD: clean up for matrix c
 
  +
------------------------------------------------------------------------
 
  +
2. Implement the compression part of the SW tagger probabilities. These parameters are stored in a 3-d array.
r45009 | elephantgcc | 2013-06-11 17:30:09 +0800 (Tue, 11 Jun 2013) | 1 line
 
  +
  +
3. Doing some debugging on the HMM tagger, in order to see how a tagger should work togethor with the whole pipeline.
   
MOD: bugfix for strange M and N in tagger.cc, should use 'td.readSWPoST' but not 'td.read'
 
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-06-10
r44984 | elephantgcc | 2013-06-10 16:07:48 +0800 (Mon, 10 Jun 2013) | 1 line
 
  +
  +
1. Implement a basic version SW tagger. But there are bugs between them.
   
MOD: training basic version, tagging basic version with bug.
+
2. The training and tagging procedures strictly follow the 2004 paper.
 
------------------------------------------------------------------------
 
------------------------------------------------------------------------
  +
2013-05-30
r44809 | elephantgcc | 2013-05-30 14:03:22 +0800 (Thu, 30 May 2013) | 1 line
 
   
  +
1. Start the project.
COPY: copy trunk/apertium to branches/apertium-swpost/apertium
 

Revision as of 10:04, 1 July 2013

GSOC 2013

I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".

my proposal is here: Proposal

svn repo

https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium

General Progress

2013-06-24: LSW tagger working, with rules.

2013-06-20: LSW tagger working, without rules.

2013-06-11: SW tagger working.

2013-05-30: Start.

Detailed progress


2013-06-28 14:21:20 +0800 (Mon, 24 Jun 2013) | 1 line

1. add find_similar_ambiguity_class for the tagging procedure. So a new ambiguity class won't crash the tagger down.

2. bugfix to the normalize factor. This makes the things right, but no improvements are gained to the quality.


2013-06-23

1. implement a light sliding-window tagger. This tagger is based on the SW tagger, with "parameter reduction" described in the 2005 paper.

2. add rule support for light-sw tagger. The rules help to improve the tagging quality.


2013-06-21

1. add ZERO define. Because there are some double comparisons, we need a relatively precise threshold.

2. bug fix for the initial procedure and iteration formula.

3. check function style so the new code is consistent to the existing code.


2013-06-20

1. use heap space for 3-dimensional parameters. This makes it possible to train successfully without manually setting the 'stack' environment.

2. add retrain() function for SW tagger. The logic of the SW tagger's retrain is the same as that of the HMM tagger. It append several iterations based on the current parameters.

3. bugfix, avoid '-nan' parameters.

4. the tagger_data write only non-ZERO values. This saves a lot of disk space, reducing the parameter file from 100M to 200k.


2013-06-19

1. add print_para_matrix() for debugging in SW tagger. This funciton only prints non-ZERO parameters in the 3d matrix.

2. add support for debug, EOS, and null_flush. This makes the tagger work stable when called by other programs.


2013-06-17

1. a deep follow into the morpho_stream class, and make sense its memembers and functions.

2. refine the reading procedure of the SW tagger, so that the procedure is simpler and more stable.


2013-06-13

1. fix a bug in tagging procedure, where the initial tag score should be -1 instead of 0.


2013-06-12

1. add option "-w" for sw tagger. The option "-w" is not a drop-in replacement to the current HMM tagger, but an extension. So the default tagger for Apertium will still be the HMM tagger. If the "-w" option is specified, the the SW tagger will be used.

   HMM tagger usage:
   apertium-tagger -t 8 es.dic  es.crp  apertium-en-es.es.tsx  es-en.prob
   apertium-tagger -g es-en.prob.new

   SW tagger usage:
   apertium-tagger -w -t 8 es.dic  es.crp  apertium-en-es.es.tsx  es-en.prob
   apertium-tagger -w -g es-en.prob.new

2. add support for judging the end of morpho_stream.

3. The first working version is OK:)


2013-06-11

1. Fix the bug of last version, mainly because of the read() method in 'TSX reader'. The read and write methods in the tsx_reader and tagger_data are re-implemented, because they are different from those for the HMM.

2. Implement the compression part of the SW tagger probabilities. These parameters are stored in a 3-d array.

3. Doing some debugging on the HMM tagger, in order to see how a tagger should work togethor with the whole pipeline.


2013-06-10

1. Implement a basic version SW tagger. But there are bugs between them.

2. The training and tagging procedures strictly follow the 2004 paper.


2013-05-30

1. Start the project.