Difference between revisions of "User:Wei2912"

From Apertium
Jump to navigation Jump to search
(→‎Conversion of PDF dictionary to lttoolbox format: Finished description on how to convert)
(Update blog link)
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
My name is Ng Wei En and I am helping out Apertium by participating as a Google Code-In mentor. I was a GCI student in 2013 and 2014, and have helped out at previous GCIs in 2015, 2016 and 2017. I have a general interest in mathematics and computer science, particularly algorithms and cryptography.
My name is Wei En and I'm currently a GCI student. My blog is at http://wei2912.github.io.
 
   
  +
'''Blog''': https://wei2912.github.io
I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many.
 
   
  +
'''GitHub''': https://github.com/wei2912
The following are projects related to Apertium.
 
   
  +
'''Twitter''': https://twitter.com/wei2912
== Wiktionary Crawler ==
 
  +
  +
== Projects ==
  +
  +
=== Wiktionary Crawler ===
   
 
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at [[Task ideas for Google Code-in/Scrape inflection information from Wiktionary]].
 
https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at [[Task ideas for Google Code-in/Scrape inflection information from Wiktionary]].
Line 11: Line 15:
 
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the [[Speling format]].
 
The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the [[Speling format]].
   
The current languages supported are Chinese (zh), Thai (th) and Lao (lo). You are welcome to contribute to this project.
+
The current languages supported are Chinese (zh), Thai (th) and Lao (lo).
   
  +
'''Note: The project has been deprecated as a more modular web crawler has been built in GCI 2015.'''
== Spaceless Segmentation ==
 
  +
  +
=== Spaceless Segmentation ===
   
 
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under [[Task ideas for Google Code-in/Tokenisation for spaceless orthographies]].
 
Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under [[Task ideas for Google Code-in/Tokenisation for spaceless orthographies]].
Line 19: Line 25:
 
The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.
 
The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.
   
  +
== Miscelleanous ==
A report comparing the above method, LRLM and RLLM (longest left to right matching and longest right to left matching respectively) is available at https://www.dropbox.com/sh/57wtof3gbcbsl7c/AABI-Mcw2E-c942BXxsMbEAja
 
   
== Conversion of PDF dictionary to lttoolbox format ==
+
=== Conversion of Sakha-English dictionary to lttoolbox format ===
 
'''NOTE: This document is a draft.'''
 
   
 
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
 
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
Line 29: Line 33:
 
We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).
 
We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).
   
  +
Then, we obtain the script for converting our dictionary:
All of this preprocessing is contained in this script which we supply a filename to.
 
   
 
<pre>
 
<pre>
  +
$ svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixscrapers/
#!/bin/bash
 
  +
$ cd dixscrapers/
cat $1 | perl -wpne 's/•//g; s/^\d+$//g; s/=//g; s/\; /\n/g; s/cf\./cf/;' > $1.new
 
  +
$ cat orig.txt | sakhadic2dix.py > sakhadic.xml
 
</pre>
 
</pre>
   
  +
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We sort and format the XML file as shown here to get the final dictionary:
After the preprocessing, we get the following file:
 
   
 
<pre>
 
<pre>
  +
$ apertium-dixtools sort sakhadic.xml sakhadic.dix
... blank lines omitted ...
 
аа exc. Oh! See!
 
ааҕыс v. to reckon with
 
аайы a. each, every
 
күн аайы every day
 
...
 
 
</pre>
 
</pre>
   
  +
Our final dictionary is in <code>sakhadic.dix</code>.
The blank lines weren't removed so that you can tell when a page starts and end, and hence coordinate the manual processing with the dictionary.
 
   
  +
For more details on sorting dictionaries, take a look at [[Sort a dictionary]].
Unfortunately for us, our preprocessor replaces "; " with "\n" in order to get a list of words seperated by newlines. Definitions may be seperated by "; " too, or spread over to the next line. Hence, we'll need to merge these lines together to get the same format as the dictionary.
 
 
Some words have different word forms. To handle this, we copy over the original word to create a new entry. This:
 
 
<pre>
 
албас a. cunning; n. trick, ruse
 
</pre>
 
 
becomes
 
 
<pre>
 
албас a. cunning
 
албас n. trick, ruse
 
</pre>
 
 
The good part about this is that they're also seperated by "; " and will be placed on a newline after the preprocessing, so it's easy to spot the lines where we need to handle this.
 
 
The final format for each entry looks similar to this:
 
 
<pre>
 
word1, word2 abbrv1. abbrv2. abbrv3. definition1, definition2, definition3; definition4
 
</pre>
 
 
Words and definitions are seperated by either commas or semicolons. Abbreviations are seperated by whitespace and indicated with the use of ".".
 
 
We pass the filename of our dictionary file to this script:
 
 
<pre>
 
#!/usr/bin/python3
 
 
import fileinput
 
import itertools
 
import re
 
import xml.etree.cElementTree as ET
 
 
BRACKETS_RE = re.compile(r'(\(.+?\)|\[.+?\])')
 
SPLIT_RE = re.compile(r'[;,]\s+')
 
 
ABBRVS = {
 
'a.': ['adj'],
 
'adv.': ['adv'],
 
# arch. archaic
 
# cf. see also
 
# comp. computer-related
 
# conv. converb, modifying verb
 
# dial. dialect
 
'det.': ['det'],
 
# Evk. Evenki
 
'exc.': ['ij'],
 
'int.': ['itg'],
 
# Mongo. Mongolian
 
'n.': ['n'],
 
'num.': ['det', 'qnt'],
 
# ono. onomatopoeia
 
'pl.': ['pl'],
 
'pp.': ['post'],
 
'pro.': ['prn'],
 
# Russ. Russian
 
'v.': ['v', 'TD']
 
}
 
 
class Entry(object):
 
def __find_brackets(self, line):
 
brackets = BRACKETS_RE.search(line)
 
if brackets:
 
return brackets.groups()
 
 
def __split(self, line):
 
return SPLIT_RE.split(line)
 
 
def __init__(self, line):
 
tags = line.split()
 
 
self.words = []
 
self.abbrvs = []
 
self.meanings = []
 
 
found_abbrv = False
 
found_conv = False
 
for tag in tags:
 
if tag in ABBRVS.keys(): # abbreviations
 
found_abbrv = True
 
self.abbrvs.extend(ABBRVS[tag])
 
continue
 
elif tag == "conv.":
 
found_abbrv = True
 
found_conv = True
 
self.abbrvs.append("vaux")
 
continue
 
 
if not found_abbrv: # entrys
 
self.words.append(tag)
 
else: # translated
 
self.meanings.append(tag)
 
 
# if there's "cf" in a word, we trim off everything else
 
for i, word in enumerate(self.words):
 
if word == "cf":
 
self.words = self.words[:i]
 
 
if found_conv:
 
self.words = self.words[-1]
 
else:
 
self.words = " ".join(self.words)
 
self.meanings = " ".join(self.meanings)
 
 
# preprocessing to place stuff
 
# we can't parse in comments
 
if not self.abbrvs:
 
self.words = None
 
self.abbrvs = None
 
self.meanings = None
 
return
 
 
# remove the brackets
 
brackets = self.__find_brackets(self.words)
 
if brackets:
 
for bracket in brackets:
 
self.words = self.words.replace(bracket, "")
 
 
brackets = self.__find_brackets(self.meanings)
 
if brackets:
 
for bracket in brackets:
 
self.meanings = self.meanings.replace(bracket, "")
 
 
# preprocessing meanings
 
self.meanings = self.meanings.replace("to", "")
 
 
# split up meanings and entrys
 
self.words = [x.strip() for x in self.__split(self.words)]
 
self.meanings = [x.strip() for x in self.__split(self.meanings)]
 
 
def insert_blanks(element, line):
 
words = line.split()
 
if not words:
 
return
 
element.text = words[0]
 
element.tail = None
 
blank = None
 
for i in words[1:]:
 
blank = ET.SubElement(element, 'b')
 
blank.tail = i
 
 
def main():
 
dictionary = ET.Element("dictionary")
 
pardefs = ET.SubElement(dictionary, "pardefs")
 
 
for line in fileinput.input():
 
line = line.strip()
 
if not line:
 
continue
 
 
comment = ET.Comment(text=line)
 
pardefs.append(comment)
 
 
entry = Entry(line)
 
if not (entry.words and entry.abbrvs and entry.meanings):
 
continue
 
 
for word, meaning in itertools.product(entry.words, entry.meanings):
 
e = ET.SubElement(pardefs, "e")
 
e.set('r', 'LR')
 
 
p = ET.SubElement(e, 'p')
 
 
## add word and meaning
 
left = ET.SubElement(p, 'l')
 
insert_blanks(left, word)
 
 
right = ET.SubElement(p, 'r')
 
insert_blanks(right, meaning)
 
 
# add abbreviations
 
for abbrv in entry.abbrvs:
 
s = ET.Element('s')
 
s.set('n', abbrv)
 
left.append(s)
 
right.append(s)
 
ET.dump(dictionary)
 
 
main()
 
</pre>
 
 
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We format the XML file as shown here:
 
 
<pre>
 
$ xmllint --format --encode utf8 file.xml > file.dix
 
</pre>
 
 
The `--encode utf8` option prevents `xmllint` from escaping our unicode.
 
 
The final file format looks like this:
 
 
<pre>
 
<?xml version="1.0" encoding="utf8"?>
 
<dictionary>
 
<pardefs>
 
<!--аа exc. Oh! See!-->
 
<e r="LR">
 
<p>
 
<l>аа<s n="ij"/></l>
 
<r>Oh!<b/>See!<s n="ij"/></r>
 
</p>
 
</e>
 
<!--ааҕыс v. to reckon with-->
 
<e r="LR">
 
<p>
 
<l>ааҕыс<s n="v"/><s n="TD"/></l>
 
<r>reckon<b/>with<s n="v"/><s n="TD"/></r>
 
</p>
 
</e>
 
<!--аайы a. each, every-->
 
<e r="LR">
 
<p>
 
<l>аайы<s n="adj"/></l>
 
<r>each<s n="adj"/></r>
 
</p>
 
</e>
 
...
 
</pre>
 

Latest revision as of 08:13, 29 May 2021

My name is Ng Wei En and I am helping out Apertium by participating as a Google Code-In mentor. I was a GCI student in 2013 and 2014, and have helped out at previous GCIs in 2015, 2016 and 2017. I have a general interest in mathematics and computer science, particularly algorithms and cryptography.

Blog: https://wei2912.github.io

GitHub: https://github.com/wei2912

Twitter: https://twitter.com/wei2912

Projects[edit]

Wiktionary Crawler[edit]

https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at Task ideas for Google Code-in/Scrape inflection information from Wiktionary.

The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the Speling format.

The current languages supported are Chinese (zh), Thai (th) and Lao (lo).

Note: The project has been deprecated as a more modular web crawler has been built in GCI 2015.

Spaceless Segmentation[edit]

Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under Task ideas for Google Code-in/Tokenisation for spaceless orthographies.

The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.

Miscelleanous[edit]

Conversion of Sakha-English dictionary to lttoolbox format[edit]

In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf

We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).

Then, we obtain the script for converting our dictionary:

$ svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixscrapers/
$ cd dixscrapers/
$ cat orig.txt | sakhadic2dix.py > sakhadic.xml

This will give us a XML dump of the dictionary, converted to the lttoolbox format. We sort and format the XML file as shown here to get the final dictionary:

$ apertium-dixtools sort sakhadic.xml sakhadic.dix

Our final dictionary is in sakhadic.dix.

For more details on sorting dictionaries, take a look at Sort a dictionary.