Difference between revisions of "User:Gang Chen"
Jump to navigation
Jump to search
Line 28: | Line 28: | ||
=== Wikipedia extractor === |
=== Wikipedia extractor === |
||
+ | http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor |
||
− | ==== target ==== |
||
− | |||
− | This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc. |
||
− | |||
− | ==== tool ==== |
||
− | http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
||
− | |||
− | ==== license ==== |
||
− | GPL-V3 |
||
− | |||
− | ==== usage ==== |
||
− | |||
− | ===== 1. Get the script ===== |
||
− | |||
− | http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
||
− | |||
− | ===== 2. Download the Wikipedia dump file here ===== |
||
− | |||
− | http://dumps.wikimedia.org/backup-index.html |
||
− | |||
− | Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/ |
||
− | |||
− | ===== 3. Use the script ===== |
||
− | <pre> |
||
− | mkdir output |
||
− | |||
− | bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output |
||
− | |||
− | </pre> |
||
− | |||
− | Optionally, we can use |
||
− | |||
− | "-c" for compression for saving disk space, and |
||
− | |||
− | "-b" for setting specified bytes per output file. |
||
− | |||
− | More information please type "./WikiExtractor --help". |
||
− | |||
− | Ok, let's have a cup of tea and come back an hour later. The output should be '''output/AA/wikiXX''', where wikiXX are the extracted texts. |
||
− | |||
− | |||
− | ===== 4. clean up "<>" tags ===== |
||
− | |||
− | We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags. |
||
− | |||
− | <pre> |
||
− | |||
− | #! /usr/bin/python |
||
− | # -*- coding:utf-8 -*- |
||
− | import sys |
||
− | import re |
||
− | regex = re.compile(ur"<.*?>") |
||
− | line = sys.stdin.readline() |
||
− | while line != "": |
||
− | line = regex.sub("", line)[ : -1] |
||
− | print line |
||
− | line = sys.stdin.readline() |
||
− | |||
− | </pre> |
||
− | |||
− | ===== 5. done :) ===== |
Revision as of 09:16, 12 July 2013
Contents
About me
Name: Gang Chen
Email: pkuchengang@gmail.com
IRC: Gang
SourceForge: elephantgcc
GitHub Repo: https://github.com/elephantgcc
GSOC 2013
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
my proposal is here: Proposal
svn repo
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
LSW tagger: Current Progress
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
Useful tools
Wikipedia extractor
http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor