Difference between revisions of "User:Wei2912"

From Apertium
Jump to navigation Jump to search
m (formatting changes)
Line 1: Line 1:
My name is Wei En and I'm currently a GCI student. My blog is at [http://wei2912.github.io] and I have a site about *nix tutorials at [http://nixtuts.info].
My name is Wei En and I'm currently a GCI student. My blog is at http://wei2912.github.io.


I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many.
I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many.
Line 7: Line 7:
== Wiktionary Crawler ==
== Wiktionary Crawler ==


[https://github.com/wei2912/WiktionaryCrawler] is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at [[Task ideas for Google Code-in/Scrape inflection information from Wiktionary]].
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at [[Task ideas for Google Code-in/Scrape inflection information from Wiktionary]].


The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the [[Speling format]].
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the [[Speling format]].
Line 15: Line 15:
== Spaceless Segmentation ==
== Spaceless Segmentation ==


Spaceless Segmentation has been merged into Apertium under [https://svn.code.sf.net/p/apertium/svn/branches/tokenisation]. It serves to tokenize languages without any whitespace. More information can be found under [[Task ideas for Google Code-in/Tokenisation for spaceless orthographies]].
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under [[Task ideas for Google Code-in/Tokenisation for spaceless orthographies]].


A write-up on this tokeniser will be available quite soon.
A write-up on this tokeniser will be available quite soon.

Revision as of 07:35, 18 May 2014

My name is Wei En and I'm currently a GCI student. My blog is at http://wei2912.github.io.

I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many.

The following are projects related to Apertium.

Wiktionary Crawler

https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at Task ideas for Google Code-in/Scrape inflection information from Wiktionary.

The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the Speling format.

The current languages supported are Chinese (zh), Thai (th) and Lao (lo). You are welcome to contribute to this project.

Spaceless Segmentation

Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under Task ideas for Google Code-in/Tokenisation for spaceless orthographies.

A write-up on this tokeniser will be available quite soon.