Difference between revisions of "User:Wei2912"
(→Conversion of PDF dictionary to lttoolbox format: Add problem) |
(Update blog link) |
||
(32 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
My name is Ng Wei En and I am helping out Apertium by participating as a Google Code-In mentor. I was a GCI student in 2013 and 2014, and have helped out at previous GCIs in 2015, 2016 and 2017. I have a general interest in mathematics and computer science, particularly algorithms and cryptography. |
|||
My name is Wei En and I'm currently a GCI student. My blog is at http://wei2912.github.io. |
|||
'''Blog''': https://wei2912.github.io |
|||
I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many. |
|||
'''GitHub''': https://github.com/wei2912 |
|||
The following are projects related to Apertium. |
|||
'''Twitter''': https://twitter.com/wei2912 |
|||
⚫ | |||
== Projects == |
|||
⚫ | |||
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at [[Task ideas for Google Code-in/Scrape inflection information from Wiktionary]]. |
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at [[Task ideas for Google Code-in/Scrape inflection information from Wiktionary]]. |
||
Line 11: | Line 15: | ||
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the [[Speling format]]. |
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the [[Speling format]]. |
||
The current languages supported are Chinese (zh), Thai (th) and Lao (lo) |
The current languages supported are Chinese (zh), Thai (th) and Lao (lo). |
||
'''Note: The project has been deprecated as a more modular web crawler has been built in GCI 2015.''' |
|||
⚫ | |||
⚫ | |||
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under [[Task ideas for Google Code-in/Tokenisation for spaceless orthographies]]. |
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under [[Task ideas for Google Code-in/Tokenisation for spaceless orthographies]]. |
||
Line 19: | Line 25: | ||
The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus. |
The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus. |
||
== Miscelleanous == |
|||
A report comparing the above method, LRLM and RLLM (longest left to right matching and longest right to left matching respectively) is available at https://www.dropbox.com/sh/57wtof3gbcbsl7c/AABI-Mcw2E-c942BXxsMbEAja |
|||
== Conversion of |
=== Conversion of Sakha-English dictionary to lttoolbox format === |
||
'''NOTE: This document is a draft.''' |
|||
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf |
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf |
||
We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format). |
|||
The following preprocessing is done (using sed and humans): |
|||
1. The PDF is converted to text. |
|||
2. Blank lines, bullet points and page numbers are removed. |
|||
3. Sections such as introduction, bibliography, etc. are removed. |
|||
4. Remove the unneeded equal signs. |
|||
Then, we obtain the script for converting our dictionary: |
|||
The process may vary for other dictionaries. |
|||
Once this is done, we obtain a dictionary file that looks like this: |
|||
<pre> |
<pre> |
||
$ svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixscrapers/ |
|||
аа |
|||
$ cd dixscrapers/ |
|||
exc. |
|||
$ cat orig.txt | sakhadic2dix.py > sakhadic.xml |
|||
Oh! See! |
|||
аа |
|||
ҕ |
|||
ыс |
|||
v. |
|||
to reckon with |
|||
аайы |
|||
a. |
|||
eac |
|||
h, every; |
|||
к |
|||
ү |
|||
н |
|||
аайы |
|||
every day |
|||
аак |
|||
cf |
|||
аах |
|||
n. |
|||
document, paper; |
|||
аах |
|||
v. |
|||
to read |
|||
аал |
|||
n. |
|||
ship, barge, float, buoy |
|||
аал |
|||
v. |
|||
to rub |
|||
аалыс |
|||
v. |
|||
to socialize, mingle with |
|||
аан |
|||
n. |
|||
door, entrance; |
|||
ааннаа |
|||
v. |
|||
to provide with a door; |
|||
олбуор |
|||
аана |
|||
n. |
|||
gate |
|||
аар |
|||
- |
|||
маар |
|||
a. |
|||
stupid |
|||
</pre> |
</pre> |
||
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We sort and format the XML file as shown here to get the final dictionary: |
|||
The problem with conversion from PDF to text usually lies in the fact that loads of words have newlines in the middle of them. This is due to a limitation of the PDF converter. |
|||
Fortunately for us, we can see that the format is somewhat regular: |
|||
<pre> |
<pre> |
||
$ apertium-dixtools sort sakhadic.xml sakhadic.dix |
|||
partofword |
|||
partofword |
|||
... |
|||
partofword |
|||
abbreviation ==> ends with a fullstop |
|||
definition |
|||
definition |
|||
... |
|||
definition |
|||
</pre> |
</pre> |
||
Our final dictionary is in <code>sakhadic.dix</code>. |
|||
The problem is to find out when the definition ends and a new word begins. |
|||
For more details on sorting dictionaries, take a look at [[Sort a dictionary]]. |
Latest revision as of 08:13, 29 May 2021
My name is Ng Wei En and I am helping out Apertium by participating as a Google Code-In mentor. I was a GCI student in 2013 and 2014, and have helped out at previous GCIs in 2015, 2016 and 2017. I have a general interest in mathematics and computer science, particularly algorithms and cryptography.
Blog: https://wei2912.github.io
GitHub: https://github.com/wei2912
Twitter: https://twitter.com/wei2912
Contents
Projects[edit]
Wiktionary Crawler[edit]
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at Task ideas for Google Code-in/Scrape inflection information from Wiktionary.
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the Speling format.
The current languages supported are Chinese (zh), Thai (th) and Lao (lo).
Note: The project has been deprecated as a more modular web crawler has been built in GCI 2015.
Spaceless Segmentation[edit]
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under Task ideas for Google Code-in/Tokenisation for spaceless orthographies.
The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.
Miscelleanous[edit]
Conversion of Sakha-English dictionary to lttoolbox format[edit]
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).
Then, we obtain the script for converting our dictionary:
$ svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixscrapers/ $ cd dixscrapers/ $ cat orig.txt | sakhadic2dix.py > sakhadic.xml
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We sort and format the XML file as shown here to get the final dictionary:
$ apertium-dixtools sort sakhadic.xml sakhadic.dix
Our final dictionary is in sakhadic.dix
.
For more details on sorting dictionaries, take a look at Sort a dictionary.