Difference between revisions of "User:Wei2912"

From Apertium
Jump to navigation Jump to search
Line 22: Line 22:
   
 
== Conversion of Sakha-English dictionary to lttoolbox format ==
 
== Conversion of Sakha-English dictionary to lttoolbox format ==
 
'''NOTE: This document is a draft.'''
 
   
 
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
 
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
Line 36: Line 34:
 
</pre>
 
</pre>
   
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We format the XML file as shown here:
+
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We sort and format the XML file as shown here to get the final dictionary:
 
<pre>
 
$ xmllint --format --encode utf8 sakhadic.xml > sakhadic.dix
 
</pre>
 
 
The `--encode utf8` option prevents `xmllint` from escaping our unicode.
 
 
The final file format should look like this:
 
   
 
<pre>
 
<pre>
 
$ apertium-dixtools sort sakhadic.xml sakhadic.dix
<?xml version="1.0" encoding="utf-8"?>
 
<dictionary>
 
<section id="main" type="standard">
 
<e>
 
<!--аа exc. Oh! See!-->
 
<p>
 
<l>аа<s n="ij"/></l>
 
<r>Oh!<b/>See!<s n="ij"/></r>
 
</p>
 
</e>
 
...
 
</section>
 
</dictionary>
 
 
</pre>
 
</pre>

Revision as of 10:01, 5 December 2014

My name is Wei En and I'm currently a GCI student. My blog is at http://wei2912.github.io.

I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many.

The following are projects related to Apertium.

Wiktionary Crawler

https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at Task ideas for Google Code-in/Scrape inflection information from Wiktionary.

The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the Speling format.

The current languages supported are Chinese (zh), Thai (th) and Lao (lo). You are welcome to contribute to this project.

Spaceless Segmentation

Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under Task ideas for Google Code-in/Tokenisation for spaceless orthographies.

The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.

A report comparing the above method, LRLM and RLLM (longest left to right matching and longest right to left matching respectively) is available at https://www.dropbox.com/sh/57wtof3gbcbsl7c/AABI-Mcw2E-c942BXxsMbEAja

Conversion of Sakha-English dictionary to lttoolbox format

In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf

We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).

Then, we obtain the script for converting our dictionary:

$ svn co https://svn.code.sf.net/p/apertium/svn/trunk/dixscrapers/
$ cat orig.txt | dixscrapers/sakhadic2dix.py > sakhadic.xml

This will give us a XML dump of the dictionary, converted to the lttoolbox format. We sort and format the XML file as shown here to get the final dictionary:

$ apertium-dixtools sort sakhadic.xml sakhadic.dix