User:Gang Chen

About me

Name: Gang Chen

Email: pkuchengang@gmail.com

IRC: Gang

SourceForge: elephantgcc

GitHub Repo: https://github.com/elephantgcc

I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".

my proposal is here: Proposal

This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.

1. Get the script.

  Let's Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/

  Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/

3. Use the script.

4. clean up "<>" tags