Difference between revisions of "User:Gang Chen"
Jump to navigation
Jump to search
(14 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== About me == |
|||
'''Name:''' Gang Chen |
|||
'''Email:''' pkuchengang@gmail.com |
|||
'''IRC:''' Gang |
|||
'''SourceForge:''' elephantgcc |
|||
'''GitHub Repo:''' https://github.com/elephantgcc |
|||
== GSOC 2013 == |
== GSOC 2013 == |
||
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium". |
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium". |
||
⚫ | |||
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22 |
|||
=== |
=== SVN === |
||
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost |
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost |
||
=== |
=== Progress === |
||
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress |
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress |
||
⚫ | |||
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Summary |
|||
=== Documentation / Final Report === |
|||
https://docs.google.com/file/d/0BxMmvpeK3ibWN0NjZmEtWnAxdDQ/edit |
|||
== Useful tools == |
== Useful tools == |
||
Line 28: | Line 22: | ||
=== Wikipedia extractor === |
=== Wikipedia extractor === |
||
http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor |
|||
⚫ | |||
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc. |
|||
⚫ | |||
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
|||
==== license ==== |
|||
GPL-V3 |
|||
==== usage ==== |
|||
===== 1. Get the script ===== |
|||
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
|||
===== 2. Download the Wikipedia dump file here ===== |
|||
http://dumps.wikimedia.org/backup-index.html |
|||
Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/ |
|||
===== 3. Use the script ===== |
|||
<pre> |
|||
mkdir output |
|||
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output |
|||
</pre> |
|||
Optionally, we can use |
|||
"-c" for compression for saving disk space, and |
|||
"-b" for setting specified bytes per output file. |
|||
More information please type "./WikiExtractor --help". |
|||
Ok, let's have a cup of tea and come back an hour later. The output should be '''output/AA/wikiXX''', where wikiXX are the extracted texts. |
|||
===== 4. clean up "<>" tags ===== |
|||
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags. |
|||
<pre> |
|||
#! /usr/bin/python |
|||
# -*- coding:utf-8 -*- |
|||
import sys |
|||
import re |
|||
regex = re.compile(ur"<.*?>") |
|||
line = sys.stdin.readline() |
|||
while line != "": |
|||
line = regex.sub("", line)[ : -1] |
|||
print line |
|||
line = sys.stdin.readline() |
|||
</pre> |
|||
===== 5. done :) ===== |
Latest revision as of 10:51, 28 May 2018
Contents
GSOC 2013[edit]
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
Proposal[edit]
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22
SVN[edit]
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost
Progress[edit]
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
Summary[edit]
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Summary
Documentation / Final Report[edit]
https://docs.google.com/file/d/0BxMmvpeK3ibWN0NjZmEtWnAxdDQ/edit
Useful tools[edit]
Wikipedia extractor[edit]
http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor