Difference between revisions of "User:Gang Chen"

From Apertium
Jump to navigation Jump to search
Line 21: Line 21:
 
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
 
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
   
=== Current Progress ===
+
=== LSW tagger: Current Progress ===
 
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
 
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
  +
  +
== Useful tools ==
  +
  +
=== Wikipedia extractor ===
  +
  +
==== tool ====
  +
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
  +
  +
==== target ====
  +
  +
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
  +
  +
==== usage ====
  +
  +
1. Get the script.
  +
  +
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
  +
  +
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html
  +
  +
Let's Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/
  +
  +
Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
  +
  +
3. Use the script.
  +
<pre>
  +
  +
</pre>
  +
  +
4. clean up "<>" tags

Revision as of 08:52, 12 July 2013

About me

Name: Gang Chen

Email: pkuchengang@gmail.com

IRC: Gang

SourceForge: elephantgcc

GitHub Repo: https://github.com/elephantgcc

GSOC 2013

I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".

my proposal is here: Proposal

svn repo

https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium

LSW tagger: Current Progress

http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress

Useful tools

Wikipedia extractor

tool

http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

target

This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.

usage

1. Get the script.

http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html

  Let's Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/
  Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/

3. Use the script.


4. clean up "<>" tags