Difference between revisions of "User:Wei2912"

From Apertium
Jump to navigation Jump to search
(→‎Conversion of PDF dictionary to lttoolbox format: Finished description on how to convert)
Line 27: Line 27:
 
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
 
In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf
   
We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the PDF format). Then, we remove the bullets and line numbers.
+
We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).
   
  +
All of this preprocessing is contained in this script which we supply a filename to.
Here's a small sample:
 
   
 
<pre>
 
<pre>
  +
#!/bin/bash
аа exc. Oh! See!
 
  +
cat $1 | perl -wpne 's/•//g; s/^\d+$//g; s/=//g; s/\; /\n/g; s/cf\./cf/;' > $1.new
ааҕыс= v. to reckon with
 
аайы a. each, every; күн аайы every day
 
аак cf аах n. document, paper; аах= v. to read
 
аал n. ship, barge, float, buoy
 
 
</pre>
 
</pre>
   
  +
After the preprocessing, we get the following file:
As we can see, words on the same line are seperated by "; ". Hence, we can replace "; " with "\n" so as to get a list of words seperated by newlines. We also remove everything within square brackets and equal signs.
 
 
All of this preprocessing is contained in this script:
 
   
 
<pre>
 
<pre>
  +
... blank lines omitted ...
#!/bin/bash
 
  +
аа exc. Oh! See!
cat $1 | perl -wpne 's/•//g; s/^\d+$//g; s/^\s*//g; s/\[.+?\]//g; s/=//g; s/\; /\n/g;' > $1.new
 
  +
ааҕыс v. to reckon with
  +
аайы a. each, every
  +
күн аайы every day
  +
...
 
</pre>
 
</pre>
   
  +
The blank lines weren't removed so that you can tell when a page starts and end, and hence coordinate the manual processing with the dictionary.
which we supply a filename to.
 
   
Unfortunately for us, definitions may be seperated by "; " too. Hence, we'll need to merge these lines together and replace the original semicolons with commas. Also, some definitions spread over to the next line; we'll also need to fix that.
+
Unfortunately for us, our preprocessor replaces "; " with "\n" in order to get a list of words seperated by newlines. Definitions may be seperated by "; " too, or spread over to the next line. Hence, we'll need to merge these lines together to get the same format as the dictionary.
   
 
Some words have different word forms. To handle this, we copy over the original word to create a new entry. This:
 
Some words have different word forms. To handle this, we copy over the original word to create a new entry. This:
Line 65: Line 64:
 
</pre>
 
</pre>
   
The good part about this is that they're also seperated by "; " and will be placed on a newline, so it's easy to spot the lines where we need to handle this.
+
The good part about this is that they're also seperated by "; " and will be placed on a newline after the preprocessing, so it's easy to spot the lines where we need to handle this.
   
  +
The final format for each entry looks similar to this:
In the process, we also remove any "cf" tags, as they are not required.
 
  +
  +
<pre>
  +
word1, word2 abbrv1. abbrv2. abbrv3. definition1, definition2, definition3; definition4
  +
</pre>
  +
  +
Words and definitions are seperated by either commas or semicolons. Abbreviations are seperated by whitespace and indicated with the use of ".".
  +
  +
We pass the filename of our dictionary file to this script:
  +
  +
<pre>
  +
#!/usr/bin/python3
  +
  +
import fileinput
  +
import itertools
  +
import re
  +
import xml.etree.cElementTree as ET
  +
  +
BRACKETS_RE = re.compile(r'(\(.+?\)|\[.+?\])')
  +
SPLIT_RE = re.compile(r'[;,]\s+')
  +
  +
ABBRVS = {
  +
'a.': ['adj'],
  +
'adv.': ['adv'],
  +
# arch. archaic
  +
# cf. see also
  +
# comp. computer-related
  +
# conv. converb, modifying verb
  +
# dial. dialect
  +
'det.': ['det'],
  +
# Evk. Evenki
  +
'exc.': ['ij'],
  +
'int.': ['itg'],
  +
# Mongo. Mongolian
  +
'n.': ['n'],
  +
'num.': ['det', 'qnt'],
  +
# ono. onomatopoeia
  +
'pl.': ['pl'],
  +
'pp.': ['post'],
  +
'pro.': ['prn'],
  +
# Russ. Russian
  +
'v.': ['v', 'TD']
  +
}
  +
  +
class Entry(object):
  +
def __find_brackets(self, line):
  +
brackets = BRACKETS_RE.search(line)
  +
if brackets:
  +
return brackets.groups()
  +
  +
def __split(self, line):
  +
return SPLIT_RE.split(line)
  +
  +
def __init__(self, line):
  +
tags = line.split()
  +
  +
self.words = []
  +
self.abbrvs = []
  +
self.meanings = []
  +
  +
found_abbrv = False
  +
found_conv = False
  +
for tag in tags:
  +
if tag in ABBRVS.keys(): # abbreviations
  +
found_abbrv = True
  +
self.abbrvs.extend(ABBRVS[tag])
  +
continue
  +
elif tag == "conv.":
  +
found_abbrv = True
  +
found_conv = True
  +
self.abbrvs.append("vaux")
  +
continue
  +
  +
if not found_abbrv: # entrys
  +
self.words.append(tag)
  +
else: # translated
  +
self.meanings.append(tag)
  +
  +
# if there's "cf" in a word, we trim off everything else
  +
for i, word in enumerate(self.words):
  +
if word == "cf":
  +
self.words = self.words[:i]
  +
  +
if found_conv:
  +
self.words = self.words[-1]
  +
else:
  +
self.words = " ".join(self.words)
  +
self.meanings = " ".join(self.meanings)
  +
  +
# preprocessing to place stuff
  +
# we can't parse in comments
  +
if not self.abbrvs:
  +
self.words = None
  +
self.abbrvs = None
  +
self.meanings = None
  +
return
  +
  +
# remove the brackets
  +
brackets = self.__find_brackets(self.words)
  +
if brackets:
  +
for bracket in brackets:
  +
self.words = self.words.replace(bracket, "")
  +
  +
brackets = self.__find_brackets(self.meanings)
  +
if brackets:
  +
for bracket in brackets:
  +
self.meanings = self.meanings.replace(bracket, "")
  +
  +
# preprocessing meanings
  +
self.meanings = self.meanings.replace("to", "")
  +
  +
# split up meanings and entrys
  +
self.words = [x.strip() for x in self.__split(self.words)]
  +
self.meanings = [x.strip() for x in self.__split(self.meanings)]
  +
  +
def insert_blanks(element, line):
  +
words = line.split()
  +
if not words:
  +
return
  +
element.text = words[0]
  +
element.tail = None
  +
blank = None
  +
for i in words[1:]:
  +
blank = ET.SubElement(element, 'b')
  +
blank.tail = i
  +
  +
def main():
  +
dictionary = ET.Element("dictionary")
  +
pardefs = ET.SubElement(dictionary, "pardefs")
  +
  +
for line in fileinput.input():
  +
line = line.strip()
  +
if not line:
  +
continue
  +
  +
comment = ET.Comment(text=line)
  +
pardefs.append(comment)
  +
  +
entry = Entry(line)
  +
if not (entry.words and entry.abbrvs and entry.meanings):
  +
continue
  +
  +
for word, meaning in itertools.product(entry.words, entry.meanings):
  +
e = ET.SubElement(pardefs, "e")
  +
e.set('r', 'LR')
  +
  +
p = ET.SubElement(e, 'p')
  +
  +
## add word and meaning
  +
left = ET.SubElement(p, 'l')
  +
insert_blanks(left, word)
  +
  +
right = ET.SubElement(p, 'r')
  +
insert_blanks(right, meaning)
  +
  +
# add abbreviations
  +
for abbrv in entry.abbrvs:
  +
s = ET.Element('s')
  +
s.set('n', abbrv)
  +
left.append(s)
  +
right.append(s)
  +
ET.dump(dictionary)
  +
  +
main()
  +
</pre>
  +
  +
This will give us a XML dump of the dictionary, converted to the lttoolbox format. We format the XML file as shown here:
  +
  +
<pre>
  +
$ xmllint --format --encode utf8 file.xml > file.dix
  +
</pre>
  +
  +
The `--encode utf8` option prevents `xmllint` from escaping our unicode.
  +
  +
The final file format looks like this:
  +
  +
<pre>
  +
<?xml version="1.0" encoding="utf8"?>
  +
<dictionary>
  +
<pardefs>
  +
<!--аа exc. Oh! See!-->
  +
<e r="LR">
  +
<p>
  +
<l>аа<s n="ij"/></l>
  +
<r>Oh!<b/>See!<s n="ij"/></r>
  +
</p>
  +
</e>
  +
<!--ааҕыс v. to reckon with-->
  +
<e r="LR">
  +
<p>
  +
<l>ааҕыс<s n="v"/><s n="TD"/></l>
  +
<r>reckon<b/>with<s n="v"/><s n="TD"/></r>
  +
</p>
  +
</e>
  +
<!--аайы a. each, every-->
  +
<e r="LR">
  +
<p>
  +
<l>аайы<s n="adj"/></l>
  +
<r>each<s n="adj"/></r>
  +
</p>
  +
</e>
  +
...
  +
</pre>

Revision as of 16:41, 2 December 2014

My name is Wei En and I'm currently a GCI student. My blog is at http://wei2912.github.io.

I decided to help out at Apertium because I find the work here quite interesting and I believe Apertium will benefit many.

The following are projects related to Apertium.

Wiktionary Crawler

https://github.com/wei2912/WiktionaryCrawler is a crawler for Wiktionary which aims to extract data from pages. It was created for a GCI task which you can read about at Task ideas for Google Code-in/Scrape inflection information from Wiktionary.

The crawler crawls a starting category (usually Category:XXX language)for subcategories, then crawls these subcategories for pages. It then passes the page to language-specific parsers which turn it into the Speling format.

The current languages supported are Chinese (zh), Thai (th) and Lao (lo). You are welcome to contribute to this project.

Spaceless Segmentation

Spaceless Segmentation has been merged into Apertium under https://svn.code.sf.net/p/apertium/svn/branches/tokenisation. It serves to tokenize languages without any whitespace. More information can be found under Task ideas for Google Code-in/Tokenisation for spaceless orthographies.

The tokeniser looks for possible tokenisations in the corpus text and selects the tokenisation which tokens appears the most in corpus.

A report comparing the above method, LRLM and RLLM (longest left to right matching and longest right to left matching respectively) is available at https://www.dropbox.com/sh/57wtof3gbcbsl7c/AABI-Mcw2E-c942BXxsMbEAja

Conversion of PDF dictionary to lttoolbox format

NOTE: This document is a draft.

In this example we're converting the following PDF file: http://home.uchicago.edu/straughn/sakhadic.pdf

We copy the text directly from the PDF file, as PDF to text converters are currently unable to convert the text properly (thanks to the arcane PDF format).

All of this preprocessing is contained in this script which we supply a filename to.

#!/bin/bash
cat $1 | perl -wpne 's/•//g; s/^\d+$//g; s/=//g; s/\; /\n/g; s/cf\./cf/;' > $1.new

After the preprocessing, we get the following file:

... blank lines omitted ...
аа exc. Oh! See!
ааҕыс v. to reckon with
аайы a. each, every
күн аайы every day
...

The blank lines weren't removed so that you can tell when a page starts and end, and hence coordinate the manual processing with the dictionary.

Unfortunately for us, our preprocessor replaces "; " with "\n" in order to get a list of words seperated by newlines. Definitions may be seperated by "; " too, or spread over to the next line. Hence, we'll need to merge these lines together to get the same format as the dictionary.

Some words have different word forms. To handle this, we copy over the original word to create a new entry. This:

албас a. cunning; n. trick, ruse

becomes

албас a. cunning
албас n. trick, ruse

The good part about this is that they're also seperated by "; " and will be placed on a newline after the preprocessing, so it's easy to spot the lines where we need to handle this.

The final format for each entry looks similar to this:

word1, word2 abbrv1. abbrv2. abbrv3. definition1, definition2, definition3; definition4

Words and definitions are seperated by either commas or semicolons. Abbreviations are seperated by whitespace and indicated with the use of ".".

We pass the filename of our dictionary file to this script:

#!/usr/bin/python3

import fileinput
import itertools
import re
import xml.etree.cElementTree as ET

BRACKETS_RE = re.compile(r'(\(.+?\)|\[.+?\])')
SPLIT_RE = re.compile(r'[;,]\s+')

ABBRVS = {
    'a.': ['adj'],
    'adv.': ['adv'],
    # arch. archaic
    # cf. see also
    # comp. computer-related
    # conv. converb, modifying verb
    # dial. dialect
    'det.': ['det'],
    # Evk. Evenki
    'exc.': ['ij'],
    'int.': ['itg'],
    # Mongo. Mongolian
    'n.': ['n'],
    'num.': ['det', 'qnt'],
    # ono. onomatopoeia
    'pl.': ['pl'],
    'pp.': ['post'],
    'pro.': ['prn'],
    # Russ. Russian
    'v.': ['v', 'TD']
}

class Entry(object):
    def __find_brackets(self, line):
        brackets = BRACKETS_RE.search(line)
        if brackets:
            return brackets.groups()

    def __split(self, line):
        return SPLIT_RE.split(line)

    def __init__(self, line):
        tags = line.split()

        self.words = []
        self.abbrvs = []
        self.meanings = []

        found_abbrv = False
        found_conv = False
        for tag in tags:
            if tag in ABBRVS.keys(): # abbreviations
                found_abbrv = True
                self.abbrvs.extend(ABBRVS[tag])
                continue
            elif tag == "conv.":
                found_abbrv = True
                found_conv = True
                self.abbrvs.append("vaux")
                continue

            if not found_abbrv: # entrys
                self.words.append(tag)
            else: # translated
                self.meanings.append(tag)

        # if there's "cf" in a word, we trim off everything else
        for i, word in enumerate(self.words):
            if word == "cf":
                self.words = self.words[:i]

        if found_conv:
            self.words = self.words[-1]
        else:
            self.words = " ".join(self.words)
        self.meanings = " ".join(self.meanings)

        # preprocessing to place stuff
        # we can't parse in comments
        if not self.abbrvs:
            self.words = None
            self.abbrvs = None
            self.meanings = None
            return

        # remove the brackets
        brackets = self.__find_brackets(self.words)
        if brackets:
            for bracket in brackets:
                self.words = self.words.replace(bracket, "")

        brackets = self.__find_brackets(self.meanings)
        if brackets:
            for bracket in brackets:
                self.meanings = self.meanings.replace(bracket, "")

        # preprocessing meanings
        self.meanings = self.meanings.replace("to", "")

        # split up meanings and entrys
        self.words = [x.strip() for x in self.__split(self.words)]
        self.meanings = [x.strip() for x in self.__split(self.meanings)]

def insert_blanks(element, line):
    words = line.split()
    if not words:
        return
    element.text = words[0]
    element.tail = None
    blank = None
    for i in words[1:]:
        blank = ET.SubElement(element, 'b')
        blank.tail = i

def main():
    dictionary = ET.Element("dictionary")
    pardefs = ET.SubElement(dictionary, "pardefs")

    for line in fileinput.input():
        line = line.strip()
        if not line:
            continue

        comment = ET.Comment(text=line)
        pardefs.append(comment)

        entry = Entry(line)
        if not (entry.words and entry.abbrvs and entry.meanings):
            continue

        for word, meaning in itertools.product(entry.words, entry.meanings):
            e = ET.SubElement(pardefs, "e")
            e.set('r', 'LR')

            p = ET.SubElement(e, 'p')

            ## add word and meaning
            left = ET.SubElement(p, 'l')
            insert_blanks(left, word)

            right = ET.SubElement(p, 'r')
            insert_blanks(right, meaning)

            # add abbreviations
            for abbrv in entry.abbrvs:
                s = ET.Element('s')
                s.set('n', abbrv)
                left.append(s)
                right.append(s)
    ET.dump(dictionary)

main()

This will give us a XML dump of the dictionary, converted to the lttoolbox format. We format the XML file as shown here:

$ xmllint --format --encode utf8 file.xml > file.dix

The `--encode utf8` option prevents `xmllint` from escaping our unicode.

The final file format looks like this:

<?xml version="1.0" encoding="utf8"?>
<dictionary>
  <pardefs>
    <!--аа exc. Oh! See!-->
    <e r="LR">
      <p>
        <l>аа<s n="ij"/></l>
        <r>Oh!<b/>See!<s n="ij"/></r>
      </p>
    </e>
    <!--ааҕыс v. to reckon with-->
    <e r="LR">
      <p>
        <l>ааҕыс<s n="v"/><s n="TD"/></l>
        <r>reckon<b/>with<s n="v"/><s n="TD"/></r>
      </p>
    </e>
    <!--аайы a. each, every-->
    <e r="LR">
      <p>
        <l>аайы<s n="adj"/></l>
        <r>each<s n="adj"/></r>
      </p>
    </e>
...