Difference between revisions of "Sentence segmenting"

From Apertium
Jump to navigation Jump to search
Line 59: Line 59:
* [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html PunktSentenceTokenizer] accepts your parameters (<code>params</code>), which we'll also use later.
* [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html PunktSentenceTokenizer] accepts your parameters (<code>params</code>), which we'll also use later.


'''Scenario 2'''
'''Scenario 2: '''
If the language uses the same punctuation as English, you create the trainer slightly differently:
If the language uses the same punctuation as English, you create the trainer slightly differently:
<pre>
<pre>
Line 70: Line 70:
* The first line is to create a [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktTrainer-class.html PunktTrainer] object
* The first line is to create a [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktTrainer-class.html PunktTrainer] object
* <code>text</code> is the (string) corpus you want to train the trainer on.
* <code>text</code> is the (string) corpus you want to train the trainer on.
* http://nltk.org/_modules/nltk/tokenize/punkt.html <code>INCLUDE_ALL_COLLOCS</code>] makes the trainer look for and remember abbreviations and initialisms.
* <code>INCLUDE_ALL_COLLOCS</code> makes the trainer look for and remember abbreviations and initialisms.
* <code>INCLUDE_ABBREV_COLLOCS</code> makes the trainer look for and remember word pairs where the first word is an abbreviation. It has to be placed after <code>INCLUDE_ALL_COLLOCS</code>.
* <code>INCLUDE_ABBREV_COLLOCS</code> makes the trainer look for and remember word pairs where the first word is an abbreviation. It has to be placed after <code>INCLUDE_ALL_COLLOCS</code>.
* We'll use the <code>params</code> variable later.
* We'll use the <code>params</code> variable later.


For every language, assuming that the variable <code>text</text> is the corpus you want to train your trainer on, do:
For every language, assuming that the variable <code>text</code> is the corpus you want to train your trainer on, do:


<pre>
<pre>

Revision as of 00:23, 2 January 2013

This page gives a review and usage instructions for some sentence-segmenting tools.

Tools

NLTK Punkt

You will need to install NLTK and NLTK data. Unfortunately, they both only support Python versions 2.6-2.7.

Prerequisites & installing NLTK

If you do not have these packages, please install them in the following order:

  1. Setuptools: sudo apt-get install python-setuptools
  2. Pip: sudo easy_install pip
  3. (optional) Numpy: sudo pip install -U numpy
  4. PyYAML and NLTK: sudo pip install -U pyyaml nltk

To be sure that it has been correctly installed, type python. Then, type import nltk. Nothing should be output. Type quit() to quit.

Installing NLTK data

NLTK data is also commonly called nltk.data.

Once you have installed NLTK, you can run the NLTK Downloader to install nltk.data.

  1. Type python to start the Python interpreter.
  2. Type import nltk.
  3. Type nltk.download() to open the NLTK Downloader.
    1. To download the sentence tokenisation package, nltk.tokenize.punkt, type d punkt.
    2. If you want to get everything (optional but recommended), type d all.
    3. For English, you also need the "corpora" package.
Training
  1. Create a new Python file.
  2. Type import nltk to import the NLTK package.
  3. You also need to import everything under NLTK (very important!): from nltk import *

Scenario 1: If you are tokenising a language that uses different punctuation than English (i.e. Armenian), then you need to set the PunktLanguageVars:

language_punkt_vars = nltk.tokenize.punkt.PunktLanguageVars
language_punkt_vars.sent_end_chars=('։','՞','՜','.')

Put the characters that are likely to end sentences in single quotation marks as arguments of the sent_end_chars() function above (second line). Depending on your language, you can change the variable name language_punkt_vars to whatever is more appropriate.


Add the PunktLanguageVars variable as a second argument for PunktTrainer, like this:

trainer = nltk.tokenize.punkt.PunktTrainer(text, language_punkt_vars)
params = trainer.get_params()
sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(params)
  • The first line is to create a PunktTrainer object
  • The variable text is the corpus that you want to train the trainer on.
  • language_punkt_vars is your PunktLanguageVars variable.
  • PunktSentenceTokenizer accepts your parameters (params), which we'll also use later.

Scenario 2: If the language uses the same punctuation as English, you create the trainer slightly differently:

trainer = nltk.tokenize.punkt.PunktTrainer(text) 
trainer.INCLUDE_ALL_COLLOCS = True 
trainer.INCLUDE_ABBREV_COLLOCS = True
params = trainer.get_params()
  • The first line is to create a PunktTrainer object
  • text is the (string) corpus you want to train the trainer on.
  • INCLUDE_ALL_COLLOCS makes the trainer look for and remember abbreviations and initialisms.
  • INCLUDE_ABBREV_COLLOCS makes the trainer look for and remember word pairs where the first word is an abbreviation. It has to be placed after INCLUDE_ALL_COLLOCS.
  • We'll use the params variable later.

For every language, assuming that the variable text is the corpus you want to train your trainer on, do:

trainer.train(text)
  • You must do this step to get a usable trainer object.

Now your trainer has been trained!

Usage

If you want to tokenize English sentences, please skip ahead.

For non-English languages:

First, we must define a sentence boundary detector:

sbd = PunktSentenceTokenizer(params)

That is where we use the params variable that was created in the "Training" section.

Now, we will use the trainer to tokenize our text:

for sentence in sbd.sentences_from_text(text, realign_boundaries=True):
  • text is the string of text that you want to tokenize

Inside the for loop, you can do anything with the newly tokenized sentence (stored in the sentence variable). For example, you could print them, separated by a newline:

print sentence + "\n"

Now you're finished!
Tokenize English sentences

Load the english pickle (provided):

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

By default, that is the path to english.pickle. Change it if you install nltk_data to a different directory when you downloaded it.

Now, you can tokenize the sentences:

sentences = sent_detector.tokenize(text.strip())
  • text is the string of text that you want to tokenize
  • sentences (the created variable) is a list of the tokenized sentences (each value is a separate sentence)

splitta

Splitta only works with Python 2.5 or later. It doesn't work with Python 3.

Training

Unfortunately, it only works for English. You can find the developer's email address here.

Usage

You can download the latest version of splitta here. Extract the files to a directory.

Download the SVM Light binaries from http://svmlight.joachims.org/ and extract all files any directory. You will need to know the file path of that directory.

Open "sbd.py" (in the splitta directory). Scroll to where is says "## globals". You will see these too lines:

SVM_LEARN = '/u/dgillick/tools/svm_light/svm_learn'
SVM_CLASSIFY = '/u/dgillick/tools/svm_light/svm_classify'

Replace the two file paths to wherever you put the files from SVM Light. For example, if you extracted them to the same directory where splitta is, then it would be:

SVM_LEARN = 'svm_learn'
SVM_CLASSIFY = 'svm_classify'

Open Terminal and cd to the directory where splitta is. The simplest command to tokenize paragraphs into sentences is:

python sbd.py -m model_nb corpus.txt

That is assuming that corpus.txt is your corpus file. Splitta provides a sample English corpus, called "sample.txt", for you to test with. It prints out:

loading model from [model_nb/]... done!
reading [corpus.txt]
featurising... done!
NB classifying... done!
<pre>

Afterwards, the tokenized sentences are outputted to the screen.

You can specify an output file (the sentences will be outputted to the file) using the -o command:

<pre>python sbd.py -m model_nb -o output.txt corpus.txt

MxTerminator

Get `jmx.tar.gz` from here. Extract it to an empty directory. Edit your CLASSPATH to include mxpost.jar:

export CLASSPATH=/usr/home/<yourname>/<yourdir>/mxpost.jar

Replace `<yourname>` with the name of the user and replace `<yourdir>` with the path to mxpost.jar

The general instructions (too minimalistic/unhelpful) are [file:///home/daniel/jmx/MXTERMINATOR.html here].

Training

Create an empty directory. This will be the directory for your project. In the directory, you must place a data file. This should contain many sentences in the language that are split by newlines. Then, run this command (traindata is the file aforementioned):

./trainmxterminator projectdir traindata

Note: According to the developer, it should work for utf-8

Usage

Run the command below. Replace projectdir with the directory of your project (a sample project, eos.project, is included) and textfile with the raw text that you want to tokenize.

./mxterminator projectdir < textfile

Sentrick

WILL BE UPDATED!

Home
FAQ (mostly troubleshooting)
New usage page

Training

Get the developer's version

git clone git://sentrick.git.sourceforge.net/gitroot/sentrick/sentrick

Run ant clean dist. If it does not work, go to common-targets.xml (under ant) and comment out/delete everything between <target name="test" depends="compile.tests" description="run junit tests" > and </target> (advice from developer; it helps).

If it complains about encoding, add the attribute encoding="MacRoman" to the two <java c> tags in common-targets.xml.

It will generate a folder called dist. In terminal, type cd dist to get to the dist directory.

You need some input data with segmented sentences separated by a newline. `editor.sh` and `snippetCollector.sh` (two minumalistic GUI tools) will save the boundary positions (punctuation that ends sentences) in a .bps file:

Type sh editor.sh to open the Sentrick Sentence Boundary Editor (aka editor.sh).

In the left panel, navigate to the input data aforementioned. You can change the encoding (not recommended) or how to detect the boundaries (which sentence boundary detector/sbd to use) using the two dropdown menus on the left. Then, press the "Save Boundary Positions" button on the right. It will generate a .bps file with the same name as your .txt input file in the same directory.

Put your .bps and .txt files into a directory.

snippetCollector accomplishes the same thing, it just assumes that you do not have an input file. You can input the text into the textbox on the left and press Segment. Then, press "Save txt,bps as...". It does not let you change the encoding.

Use tdgen.sh (already-built) to create four .pl files.

This is the usage:

tdgen [txt,bps root directory] [encoding] [SbdProvider (for resources)] [training data output directory] <token id prefix>

Here's an example command:

sh tdgen.sh dir utf-8 de.denkselbst.sentrick.sbd.NoSbdProvider ../out
<pre>

dir is the directory of the .txt and the .bps files. de.denkselbst.sentrick.sbd.NoSbdProvider is the sbd. ../out is the output directory (where your 4 .pl files will end up).

;Usage 
[http://sourceforge.net/projects/sentrick/files/latest/download Download] the latest version of Sentrick. Currently, Sentrick comes with sentence tokenisation for English and German.

Extract the files into a directory. Sentrick needs two arguments to run: the input file that contains paragraphs that need to be tokenized (<code>-i</code>) and the file it will output the tokenized sentences to (<code>-o</code>). An example command is:

<pre>
./sentrick/bin/sentrick.sh -i input.txt -o output.txt

You must run sentrick.sh from outside the bin directory.

By default, the encodings of both file will be UTF-8. Optionally, you can declare the encodings of the two files using -ie (for the input file) and -oe (for the output file). For example, if you want the input file to be encoded as UCS-2 and the output file to be encoded as UTF-32, you could use this code:

./sentrick/bin/sentrick.sh -i input.txt -o output.txt -ie UCS-2 -oe UTF-32

If you want to tell Sentrick what language the input file is in (it defaults to English), then you can use the -l argument. Currently, only English (en) and German (de) are supported.