Difference between revisions of "User:Wei2912"
(→Spaceless Segmentation: Add writeup and link to report) |
(Add draft on converting pdf to bidix) |
||
Line 20: | Line 20: | ||
A report comparing the above method, LRLM and RLLM (longest left to right matching and longest right to left matching respectively) is available at https://www.dropbox.com/sh/57wtof3gbcbsl7c/AABI-Mcw2E-c942BXxsMbEAja |
A report comparing the above method, LRLM and RLLM (longest left to right matching and longest right to left matching respectively) is available at https://www.dropbox.com/sh/57wtof3gbcbsl7c/AABI-Mcw2E-c942BXxsMbEAja |
||
== Conversion of PDF dictionary to lttoolbox format == |
|||
**NOTE: This document is a draft.** |
|||
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf |
|||
The following preprocessing is done (using sed and humans): |
|||
1. The PDF is converted to text. |
|||
2. Blank lines, bullet points and page numbers are removed. |
|||
3. Sections such as introduction, bibliography, etc. are removed. |
|||
4. Remove the unneeded equal signs. |
|||
The process may vary for other dictionaries. |
|||
Once this is done, we obtain a dictionary file that looks like this: |
|||
<pre> |
|||
аа |
|||
exc. |
|||
Oh! See! |
|||
аа |
|||
ҕ |
|||
ыс |
|||
v. |
|||
to reckon with |
|||
аайы |
|||
a. |
|||
eac |
|||
h, every; |
|||
к |
|||
ү |
|||
н |
|||
аайы |
|||
every day |
|||
аак |
|||
cf |
|||
аах |
|||
n. |
|||
document, paper; |
|||
аах |
|||
v. |
|||
to read |
|||
аал |
|||
n. |
|||
ship, barge, float, buoy |
|||
аал |
|||
v. |
|||
to rub |
|||
аалыс |
|||
v. |
|||
to socialize, mingle with |
|||
аан |
|||
n. |
|||
door, entrance; |
|||
ааннаа |
|||
v. |
|||
to provide with a door; |
|||
олбуор |
|||
аана |
|||
n. |
|||
gate |
|||
аар |
|||
- |
|||
маар |
|||
a. |
|||
stupid |
|||
</pre> |
|||
The problem with conversion from PDF to text usually lies in the fact that loads of words have newlines in the middle of them. This is due to a limitation of the PDF converter. |
|||
Fortunately for us, we can see that the format is somewhat regular: |
|||
<pre> |
|||
partofword |
|||
partofword |
|||
... |
|||
partofword |
|||
abbreviation ==> ends with a fullstop |
|||
definition |
|||
definition |
|||
... |
|||
definition |
|||
</pre> |
Revision as of 06:31, 2 December 2014
My name is Wei En and I'm currently a GCI student. My blog is at http://wei2912.github.io.
I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many.
The following are projects related to Apertium.
Wiktionary Crawler
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at Task ideas for Google Code-in/Scrape inflection information from Wiktionary.
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the Speling format.
The current languages supported are Chinese (zh), Thai (th) and Lao (lo). You are welcome to contribute to this project.
Spaceless Segmentation
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under Task ideas for Google Code-in/Tokenisation for spaceless orthographies.
The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.
A report comparing the above method, LRLM and RLLM (longest left to right matching and longest right to left matching respectively) is available at https://www.dropbox.com/sh/57wtof3gbcbsl7c/AABI-Mcw2E-c942BXxsMbEAja
Conversion of PDF dictionary to lttoolbox format
- NOTE: This document is a draft.**
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
The following preprocessing is done (using sed and humans): 1. The PDF is converted to text. 2. Blank lines, bullet points and page numbers are removed. 3. Sections such as introduction, bibliography, etc. are removed. 4. Remove the unneeded equal signs.
The process may vary for other dictionaries.
Once this is done, we obtain a dictionary file that looks like this:
аа exc. Oh! See! аа ҕ ыс v. to reckon with аайы a. eac h, every; к ү н аайы every day аак cf аах n. document, paper; аах v. to read аал n. ship, barge, float, buoy аал v. to rub аалыс v. to socialize, mingle with аан n. door, entrance; ааннаа v. to provide with a door; олбуор аана n. gate аар - маар a. stupid
The problem with conversion from PDF to text usually lies in the fact that loads of words have newlines in the middle of them. This is due to a limitation of the PDF converter.
Fortunately for us, we can see that the format is somewhat regular:
partofword partofword ... partofword abbreviation ==> ends with a fullstop definition definition ... definition