Difference between revisions of "User:Gang Chen"

From Apertium
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:

== About me ==

'''Name:''' Gang Chen

'''Email:''' pkuchengang@gmail.com

'''IRC:''' Gang

'''SourceForge:''' elephantgcc

'''GitHub Repo:''' https://github.com/elephantgcc

== GSOC 2013 ==
== GSOC 2013 ==


I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".


=== Proposal ===
my proposal is here: [http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22 Proposal]
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22


=== svn repo ===
=== SVN ===
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost


=== LSW tagger: Current Progress ===
=== Progress ===
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress

=== Summary ===
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Summary

=== Documentation / Final Report ===
https://docs.google.com/file/d/0BxMmvpeK3ibWN0NjZmEtWnAxdDQ/edit


== Useful tools ==
== Useful tools ==
Line 28: Line 22:
=== Wikipedia extractor ===
=== Wikipedia extractor ===


http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
==== target ====

This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.

==== tool ====
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

==== license ====
GPL-V3

==== usage ====

===== 1. Get the script =====

http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

===== 2. Download the Wikipedia dump file here =====

http://dumps.wikimedia.org/backup-index.html

Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/

===== 3. Use the script =====
<pre>
mkdir output

bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output

</pre>

Optionally, we can use

"-c" for compression for saving disk space, and

"-b" for setting specified bytes per output file.

More information please type "./WikiExtractor --help".

Ok, let's have a cup of tea and come back an hour later. The output should be '''output/AA/wikiXX''', where wikiXX are the extracted texts.


===== 4. clean up "<>" tags =====

We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.

<pre>

#! /usr/bin/python
# -*- coding:utf-8 -*-
import sys
import re
regex = re.compile(ur"<.*?>")
line = sys.stdin.readline()
while line != "":
line = regex.sub("", line)[ : -1]
print line
line = sys.stdin.readline()

</pre>

===== 5. done :) =====

Latest revision as of 10:51, 28 May 2018