Difference between revisions of "User:Gang Chen"

From Apertium
Jump to navigation Jump to search
Line 27: Line 27:
   
 
=== Wikipedia extractor ===
 
=== Wikipedia extractor ===
 
==== tool ====
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
   
 
==== target ====
 
==== target ====
   
 
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
 
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
  +
 
==== tool ====
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
   
 
==== usage ====
 
==== usage ====
Line 39: Line 39:
 
1. Get the script.
 
1. Get the script.
   
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
+
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
   
 
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html
 
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html
Line 49: Line 49:
 
3. Use the script.
 
3. Use the script.
 
<pre>
 
<pre>
  +
  +
mkdir output
  +
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output
  +
  +
optionally, we can use "-c" for compression for saving disk space, and "-b" for setting specified bytes per output file. More information please type "./WikiExtractor --help".
   
 
</pre>
 
</pre>
  +
  +
Ok, let's have a cup of tea and come back an hour later.
  +
  +
The output should be output/AA/wikiXX, where wikiXX are the extracted texts.
   
 
4. clean up "<>" tags
 
4. clean up "<>" tags
  +
  +
We are one step from the final text corpus, because there links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.
  +
  +
<pre>
  +
  +
#! /usr/bin/python
  +
# -*- coding:utf-8 -*-
  +
import sys
  +
import re
  +
regex = re.compile(ur"<.*?>")
  +
line = sys.stdin.readline()
  +
while line != "":
  +
line = regex.sub("", line)[ : -1]
  +
print line
  +
line = sys.stdin.readline()
  +
  +
</pre>
  +
  +
5. done :)

Revision as of 09:07, 12 July 2013

About me

Name: Gang Chen

Email: pkuchengang@gmail.com

IRC: Gang

SourceForge: elephantgcc

GitHub Repo: https://github.com/elephantgcc

GSOC 2013

I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".

my proposal is here: Proposal

svn repo

https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium

LSW tagger: Current Progress

http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress

Useful tools

Wikipedia extractor

target

This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.

tool

http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

usage

1. Get the script.

  http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html

  Let's Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/
  Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/

3. Use the script.


mkdir output
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output

optionally, we can use "-c" for compression for saving disk space, and "-b" for setting specified bytes per output file. More information please type "./WikiExtractor --help".

   Ok, let's have a cup of tea and come back an hour later.
   The output should be output/AA/wikiXX, where wikiXX are the extracted texts.

4. clean up "<>" tags

   We are one step from the final text corpus, because there links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.

#! /usr/bin/python
# -*- coding:utf-8 -*-
import sys
import re
regex = re.compile(ur"<.*?>")
line = sys.stdin.readline()
while line != "":
	line = regex.sub("", line)[ : -1]
	print line
	line = sys.stdin.readline()

5. done :)