Difference between revisions of "User:Gang Chen"
Line 21: | Line 21: | ||
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium |
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium |
||
− | === Current Progress === |
+ | === LSW tagger: Current Progress === |
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress |
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress |
||
+ | |||
+ | == Useful tools == |
||
+ | |||
+ | === Wikipedia extractor === |
||
+ | |||
+ | ==== tool ==== |
||
+ | http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
||
+ | |||
+ | ==== target ==== |
||
+ | |||
+ | This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc. |
||
+ | |||
+ | ==== usage ==== |
||
+ | |||
+ | 1. Get the script. |
||
+ | |||
+ | http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
||
+ | |||
+ | 2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html |
||
+ | |||
+ | Let's Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/ |
||
+ | |||
+ | Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/ |
||
+ | |||
+ | 3. Use the script. |
||
+ | <pre> |
||
+ | |||
+ | </pre> |
||
+ | |||
+ | 4. clean up "<>" tags |
Revision as of 08:52, 12 July 2013
Contents
About me
Name: Gang Chen
Email: pkuchengang@gmail.com
IRC: Gang
SourceForge: elephantgcc
GitHub Repo: https://github.com/elephantgcc
GSOC 2013
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
my proposal is here: Proposal
svn repo
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
LSW tagger: Current Progress
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
Useful tools
Wikipedia extractor
tool
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
target
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
usage
1. Get the script.
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html
Let's Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/
Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
3. Use the script.
4. clean up "<>" tags