User:Wei2912
My name is Wei En and I am currently helping out as a Google Code-In mentor. I was a GCI student in 2013 and 2014, and have helped out at previous GCIs in 2015 and 2016. My blog is at http://wei2912.github.io.
Contents
Projects
Wiktionary Crawler
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at Task ideas for Google Code-in/Scrape inflection information from Wiktionary.
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the Speling format.
The current languages supported are Chinese (zh), Thai (th) and Lao (lo).
Note: The project has been deprecated as a more modular web crawler has been built in GCI 2015.
Spaceless Segmentation
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under Task ideas for Google Code-in/Tokenisation for spaceless orthographies.
The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.
Miscelleanous
Conversion of Sakha-English dictionary to lttoolbox format
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).
Then, we obtain the script for converting our dictionary:
$ svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixscrapers/ $ cd dixscrapers/ $ cat orig.txt | sakhadic2dix.py > sakhadic.xml
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We sort and format the XML file as shown here to get the final dictionary:
$ apertium-dixtools sort sakhadic.xml sakhadic.dix
Our final dictionary is in sakhadic.dix
.
For more details on sorting dictionaries, take a look at Sort a dictionary.