Difference between revisions of "User:Gang Chen"

From Apertium
Jump to navigation Jump to search
Line 28: Line 28:
 
=== Wikipedia extractor ===
 
=== Wikipedia extractor ===
   
  +
http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
==== target ====
 
 
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
 
 
==== tool ====
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
 
==== license ====
 
GPL-V3
 
 
==== usage ====
 
 
===== 1. Get the script =====
 
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
 
===== 2. Download the Wikipedia dump file here =====
 
 
http://dumps.wikimedia.org/backup-index.html
 
 
Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
 
 
===== 3. Use the script =====
 
<pre>
 
mkdir output
 
 
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output
 
 
</pre>
 
 
Optionally, we can use
 
 
"-c" for compression for saving disk space, and
 
 
"-b" for setting specified bytes per output file.
 
 
More information please type "./WikiExtractor --help".
 
 
Ok, let's have a cup of tea and come back an hour later. The output should be '''output/AA/wikiXX''', where wikiXX are the extracted texts.
 
 
 
===== 4. clean up "<>" tags =====
 
 
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.
 
 
<pre>
 
 
#! /usr/bin/python
 
# -*- coding:utf-8 -*-
 
import sys
 
import re
 
regex = re.compile(ur"<.*?>")
 
line = sys.stdin.readline()
 
while line != "":
 
line = regex.sub("", line)[ : -1]
 
print line
 
line = sys.stdin.readline()
 
 
</pre>
 
 
===== 5. done :) =====
 

Revision as of 09:16, 12 July 2013

About me

Name: Gang Chen

Email: pkuchengang@gmail.com

IRC: Gang

SourceForge: elephantgcc

GitHub Repo: https://github.com/elephantgcc

GSOC 2013

I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".

my proposal is here: Proposal

svn repo

https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium

LSW tagger: Current Progress

http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress

Useful tools

Wikipedia extractor

http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor