Difference between revisions of "User:Gang Chen"

From Apertium
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== About me ==
 
 
'''Name:''' Gang Chen
 
 
'''Email:''' pkuchengang@gmail.com
 
 
'''IRC:''' Gang
 
 
'''SourceForge:''' elephantgcc
 
 
'''GitHub Repo:''' https://github.com/elephantgcc
 
 
 
== GSOC 2013 ==
 
== GSOC 2013 ==
   
 
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
 
I'm working with Apertium for the GSoC 2013, on the project "Sliding Window Part of Speech Tagger for Apertium".
   
 
=== Proposal ===
my proposal is here: [http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22 Proposal]
+
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Application:_%22Sliding_Window_PoS_Tagger%22
   
=== svn repo ===
+
=== SVN ===
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost/apertium
+
https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost
   
=== LSW tagger: Current Progress ===
+
=== Progress ===
 
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
 
http://wiki.apertium.org/w/index.php?title=User:Gang_Chen/GSoC_2013_Progress
  +
 
=== Summary ===
  +
http://wiki.apertium.org/wiki/User:Gang_Chen/GSoC_2013_Summary
  +
  +
=== Documentation / Final Report ===
  +
https://docs.google.com/file/d/0BxMmvpeK3ibWN0NjZmEtWnAxdDQ/edit
   
 
== Useful tools ==
 
== Useful tools ==
Line 28: Line 22:
 
=== Wikipedia extractor ===
 
=== Wikipedia extractor ===
   
  +
http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
==== tool ====
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
 
==== target ====
 
 
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc.
 
 
==== usage ====
 
 
1. Get the script.
 
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
 
2. Download the Wikipedia dump file from http://dumps.wikimedia.org/backup-index.html
 
 
Let's Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/
 
 
Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
 
 
3. Use the script.
 
<pre>
 
 
</pre>
 
 
4. clean up "<>" tags
 

Latest revision as of 10:51, 28 May 2018