Difference between revisions of "User:Gang Chen"
Line 27: | Line 27: | ||
=== Wikipedia extractor === |
=== Wikipedia extractor === |
||
⚫ | |||
⚫ | |||
==== target ==== |
==== target ==== |
||
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc. |
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc. |
||
⚫ | |||
⚫ | |||
==== usage ==== |
==== usage ==== |
||
Line 39: | Line 39: | ||
1. Get the script. |
1. Get the script. |
||
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
||
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html |
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html |
||
Line 49: | Line 49: | ||
3. Use the script. |
3. Use the script. |
||
<pre> |
<pre> |
||
mkdir output |
|||
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output |
|||
optionally, we can use "-c" for compression for saving disk space, and "-b" for setting specified bytes per output file. More information please type "./WikiExtractor --help". |
|||
</pre> |
</pre> |
||
Ok, let's have a cup of tea and come back an hour later. |
|||
The output should be output/AA/wikiXX, where wikiXX are the extracted texts. |
|||
4. clean up "<>" tags |
4. clean up "<>" tags |
||
We are one step from the final text corpus, because there links in wikiXX files. Let's use the following tiny script to filter out "<>" tags. |
|||
<pre> |
|||
#! /usr/bin/python |
|||
# -*- coding:utf-8 -*- |
|||
import sys |
|||
import re |
|||
regex = re.compile(ur"<.*?>") |
|||
line = sys.stdin.readline() |
|||
while line != "": |
|||
line = regex.sub("", line)[ : -1] |
|||
print line |
|||
line = sys.stdin.readline() |
|||
</pre> |
|||
5. done :) |
Revision as of 09:07, 12 July 2013
Contents
About me
Name: Gang Chen
Email: pkuchengang@gmail.com
IRC: Gang
SourceForge: elephantgcc
GitHub Repo: https://github.com/elephantgcc
GSOC 2013
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
my proposal is here: Proposal
svn repo
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
LSW tagger: Current Progress
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
Useful tools
Wikipedia extractor
target
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
tool
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
usage
1. Get the script.
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html
Let's Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/
Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
3. Use the script.
mkdir output bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output optionally, we can use "-c" for compression for saving disk space, and "-b" for setting specified bytes per output file. More information please type "./WikiExtractor --help".
Ok, let's have a cup of tea and come back an hour later.
The output should be output/AA/wikiXX, where wikiXX are the extracted texts.
4. clean up "<>" tags
We are one step from the final text corpus, because there links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.
#! /usr/bin/python # -*- coding:utf-8 -*- import sys import re regex = re.compile(ur"<.*?>") line = sys.stdin.readline() while line != "": line = regex.sub("", line)[ : -1] print line line = sys.stdin.readline()
5. done :)