Difference between revisions of "User:Gang Chen"

Latest revision as of 10:51, 28 May 2018

GSOC 2013[edit]

I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".

Useful tools[edit]

Wikipedia extractor[edit]

http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor

@@ Line 1: / Line 1: @@
-== About me ==
-'''Name:''' Gang Chen
-'''Email:''' pkuchengang@gmail.com
-'''IRC:''' Gang
-'''SourceForge:''' elephantgcc
-'''GitHub Repo:''' https://github.com/elephantgcc
 == GSOC 2013 ==
 I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
+=== Proposal ===
-my proposal is here: [http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22 Proposal]
+http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22
-=== svn repo ===
+=== SVN ===
-https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
+https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost
-=== LSW tagger: Current Progress ===
+=== Progress ===
 http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
+=== Summary ===
+http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Summary
+=== Documentation / Final Report ===
+https://docs.google.com/file/d/0BxMmvpeK3ibWN0NjZmEtWnAxdDQ/edit
 == Useful tools ==
@@ Line 28: / Line 22: @@
 === Wikipedia extractor ===
+http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
-==== target ====
-This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
-==== tool ====
-http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
-==== usage ====
-. Get the script.
-   http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
-. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html
-   Let's Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/
-   Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
-. Use the script.
-<pre>
-mkdir output
-bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output
-optionally, we can use "-c" for compression for saving disk space, and "-b" for setting specified bytes per output file. More information please type "./WikiExtractor --help".
-</pre>
-    Ok, let's have a cup of tea and come back an hour later.
-    The output should be output/AA/wikiXX, where wikiXX are the extracted texts.
-. clean up "<>" tags
-    We are one step from the final text corpus, because there links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.
-<pre>
-#! /usr/bin/python
-# -*- coding:utf-8 -*-
-import sys
-import re
-regex = re.compile(ur"<.*?>")
-line = sys.stdin.readline()
-while line != "":
-	line = regex.sub("", line)[ : -1]
-	print line
-	line = sys.stdin.readline()
-</pre>
-. done :)

Difference between revisions of "User:Gang Chen"

Latest revision as of 10:51, 28 May 2018

Contents

GSOC 2013[edit]

Proposal[edit]

SVN[edit]

Progress[edit]

Summary[edit]

Documentation / Final Report[edit]

Useful tools[edit]

Wikipedia extractor[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools