Difference between revisions of "User:Gang Chen"

From Apertium
Jump to navigation Jump to search
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== About me ==
 
 
'''Name:''' Gang Chen
 
 
'''Email:''' pkuchengang@gmail.com
 
 
'''IRC:''' Gang
 
 
'''SourceForge:''' elephantgcc
 
 
'''GitHub Repo:''' https://github.com/elephantgcc
 
 
 
== GSOC 2013 ==
 
== GSOC 2013 ==
   
 
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
 
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
   
 
=== Proposal ===
my proposal is here: [http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22 Proposal]
+
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22
   
=== svn repo ===
+
=== SVN ===
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
+
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost
   
=== LSW tagger: Current Progress ===
+
=== Progress ===
 
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
 
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
  +
 
=== Summary ===
  +
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Summary
  +
  +
=== Documentation / Final Report ===
  +
https://docs.google.com/file/d/0BxMmvpeK3ibWN0NjZmEtWnAxdDQ/edit
   
 
== Useful tools ==
 
== Useful tools ==
Line 28: Line 22:
 
=== Wikipedia extractor ===
 
=== Wikipedia extractor ===
   
  +
http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
==== target ====
 
 
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
 
 
==== tool ====
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
 
==== usage ====
 
 
===== 1. Get the script =====
 
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
 
===== 2. Download the Wikipedia dump file here =====
 
 
http://dumps.wikimedia.org/backup-index.html
 
 
Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
 
 
===== 3. Use the script =====
 
<pre>
 
mkdir output
 
 
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output
 
 
</pre>
 
 
Optionally, we can use
 
 
"-c" for compression for saving disk space, and
 
 
"-b" for setting specified bytes per output file.
 
 
More information please type "./WikiExtractor --help".
 
 
Ok, let's have a cup of tea and come back an hour later. The output should be '''output/AA/wikiXX''', where wikiXX are the extracted texts.
 
 
 
===== 4. clean up "<>" tags =====
 
 
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.
 
 
<pre>
 
 
#! /usr/bin/python
 
# -*- coding:utf-8 -*-
 
import sys
 
import re
 
regex = re.compile(ur"<.*?>")
 
line = sys.stdin.readline()
 
while line != "":
 
line = regex.sub("", line)[ : -1]
 
print line
 
line = sys.stdin.readline()
 
 
</pre>
 
 
===== 5. done :) =====
 

Latest revision as of 10:51, 28 May 2018