Difference between revisions of "Sentence segmenting"
(23 intermediate revisions by one other user not shown) | |||
Line 2: | Line 2: | ||
This page gives a review and usage instructions for some sentence-segmenting tools. |
This page gives a review and usage instructions for some sentence-segmenting tools. |
||
Keywords: Sentence segmentation, sentence tokenization, sentence tokenisation |
|||
{| class=wikitable |
|||
|- |
|||
|Tool |
|||
|Author/article |
|||
|Method |
|||
|Language |
|||
|- |
|||
|- |
|||
|NLTK Punkt |
|||
|[https://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485 Kiss & Strunk 2006] |
|||
|Unsupervised |
|||
|Python 2 & 3 |
|||
|- |
|||
|splitta |
|||
|[https://dl.acm.org/citation.cfm?id=1620920 Gillick 2009] |
|||
|Supervised, SVM |
|||
|Python 2 |
|||
|- |
|||
|MxTerminator |
|||
|[https://arxiv.org/pdf/cmp-lg/9704002.pdf Reynar & Ratnaparkhi 1997] |
|||
|Supervised, maxent |
|||
|Java |
|||
|- |
|||
⚫ | |||
|Patrick Tschorn |
|||
|[https://web.archive.org/web/20120620210553/http://www.denkselbst.de:80/sentrick/index.html seems supervised] |
|||
|Java and Prolog(!) |
|||
|} |
|||
==Tools== |
==Tools== |
||
Line 7: | Line 47: | ||
===NLTK Punkt=== |
===NLTK Punkt=== |
||
You will need to install NLTK and NLTK data. Unfortunately, they both only support Python versions 2.6-2.7. |
You will need to install NLTK and NLTK data. Unfortunately, they both only support Python versions 2.6-2.7. If you are using Python 3, you can run [http://wiki.apertium.org/wiki/Getnltk.py getnltk.py] from inside your Python 3 file and it will return the tokenised text. |
||
;Prerequisites & installing NLTK: |
;Prerequisites & installing NLTK: |
||
If you do not have these, please install them in the following order: |
If you do not have these packages, please install them in the following order: |
||
* Setuptools: <code>sudo apt-get install python-setuptools</code> |
* Setuptools: <code>sudo apt-get install python-setuptools</code> |
||
* Pip: <code>sudo easy_install pip</code> |
* Pip: <code>sudo easy_install pip</code> |
||
Line 24: | Line 64: | ||
Once you have installed NLTK, you can run the NLTK Downloader to install nltk.data. |
Once you have installed NLTK, you can run the NLTK Downloader to install nltk.data. |
||
* Type <code>python</code> to start the Python interpreter. |
|||
* Type <code>import nltk</code>. |
|||
* Type <code>nltk.download()</code> to open the NLTK Downloader. |
|||
** To download the sentence tokenisation package, <code>nltk.tokenize.punkt</code>, type <code>d punkt</code>. |
|||
** If you want to get everything (optional but recommended), type <code>d all</code>. |
|||
** For English, you also need the "corpora" package. |
|||
;Training |
;Training |
||
* Create a new Python file. |
|||
* Type <code>import nltk</code>. |
* Type <code>import nltk</code> to import the NLTK package. |
||
* You also need to import everything under NLTK (very important!): <code>from nltk import *</code> |
|||
'''Scenario 1:''' |
|||
If you are tokenising a language that uses different punctuation than English (i.e. Armenian), then you need to set the PunktLanguageVars: |
If you are tokenising a language that uses different punctuation than English (i.e. Armenian), then you need to set the [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktLanguageVars-class.html PunktLanguageVars]: |
||
<pre> |
<pre> |
||
language_punkt_vars = nltk.tokenize.punkt.PunktLanguageVars |
|||
language_punkt_vars.sent_end_chars=('։','՞','՜','.') |
|||
</pre> |
</pre> |
||
Put the characters that are likely to end sentences in single quotation marks as arguments of the <code>sent_end_chars()</code> function above (second line). Depending on your language, you can change the variable <code> |
Put the characters that are likely to end sentences in single quotation marks as arguments of the <code>sent_end_chars()</code> function above (second line). Depending on your language, you can change the variable name <code>language_punkt_vars</code> to whatever is more appropriate. |
||
* Now, it's time to train the trainer! |
|||
Add the PunktLanguageVars variable as a second argument for [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktTrainer-class.html PunktTrainer], like this: |
|||
<pre> |
<pre> |
||
trainer = nltk.tokenize.punkt.PunktTrainer( |
trainer = nltk.tokenize.punkt.PunktTrainer(traindata, language_punkt_vars) |
||
params = trainer.get_params() |
params = trainer.get_params() |
||
sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(params) |
sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(params) |
||
</pre> |
</pre> |
||
* The first line is to create a [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktTrainer-class.html PunktTrainer] object |
|||
* The variable <code> |
* The variable <code>traindata</code> is the corpus that you want to train the trainer on. |
||
* <code>armenian_punkt_vars</code> is your PunktLanguageVars variable. |
|||
* |
* <code>language_punkt_vars</code> is your PunktLanguageVars variable. |
||
* [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html PunktSentenceTokenizer] accepts your parameters (<code>params</code>), which we'll also use later. |
|||
'''Scenario 2: ''' |
|||
If the language uses the same punctuation as English, you create the trainer slightly differently: |
If the language uses the same punctuation as English, you create the trainer slightly differently: |
||
<pre> |
<pre> |
||
trainer = nltk.tokenize.punkt.PunktTrainer( |
trainer = nltk.tokenize.punkt.PunktTrainer(traindata) |
||
trainer.INCLUDE_ALL_COLLOCS = True |
trainer.INCLUDE_ALL_COLLOCS = True |
||
trainer.INCLUDE_ABBREV_COLLOCS = True |
trainer.INCLUDE_ABBREV_COLLOCS = True |
||
Line 64: | Line 108: | ||
</pre> |
</pre> |
||
* The first line is to create a [http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktTrainer-class.html PunktTrainer] object |
|||
* <code> |
* <code>traindata</code> is the (string) corpus you want to train the trainer on. |
||
* <code>INCLUDE_ALL_COLLOCS</code> makes the trainer look for and remember abbreviations and initialisms. |
* <code>INCLUDE_ALL_COLLOCS</code> makes the trainer look for and remember abbreviations and initialisms. |
||
* <code>INCLUDE_ABBREV_COLLOCS</code> makes the trainer look for and remember word pairs where the first word is an abbreviation. It has to be placed after <code>INCLUDE_ALL_COLLOCS</code>. |
* <code>INCLUDE_ABBREV_COLLOCS</code> makes the trainer look for and remember word pairs where the first word is an abbreviation. It has to be placed after <code>INCLUDE_ALL_COLLOCS</code>. |
||
* The documentation for the two aforementioned booleans can be found [http://nltk.org/_modules/nltk/tokenize/punkt.html here]. |
|||
* We'll use the <code>params</code> variable later. |
* We'll use the <code>params</code> variable later. |
||
For every language, assuming that the variable <code> |
For every language, assuming that the variable <code>traindata</code> is the corpus you want to train your trainer on, do: |
||
<pre> |
<pre> |
||
trainer.train( |
trainer.train(traindata) |
||
</pre> |
</pre> |
||
Line 81: | Line 127: | ||
;Usage |
;Usage |
||
'''Scenario 1: ''' Languages other than English: |
|||
If you want to tokenize English sentences, please skip ahead. |
|||
For non-English languages: |
|||
First, we must define a sentence boundary detector: |
First, we must define a sentence boundary detector: |
||
Line 94: | Line 138: | ||
<pre> |
<pre> |
||
for sentence in sbd.sentences_from_text( |
for sentence in sbd.sentences_from_text(tobetokenized, realign_boundaries=True): |
||
</pre> |
</pre> |
||
* <code> |
* <code>tobetokenized</code> is the string of text that you want to tokenize |
||
Inside the for loop, you can do anything with the newly tokenized sentence (stored in the <code>sentence</code> variable). For example, you could print them, separated by a newline: |
Inside the for loop, you can do anything with the newly tokenized sentence (stored in the <code>sentence</code> variable). For example, you could print them, separated by a newline: |
||
Line 107: | Line 151: | ||
Now you're finished! |
Now you're finished! |
||
<br /> |
<br /> |
||
'''Scenario 2: '''Tokenize English sentences |
|||
'''Tokenize English sentences''' |
|||
<br /> |
<br /> |
||
Line 120: | Line 164: | ||
<pre> |
<pre> |
||
sentences = sent_detector.tokenize( |
sentences = sent_detector.tokenize(tobetokenized.strip()) |
||
</pre> |
</pre> |
||
* <code> |
* <code>tobetokenized</code> is the string of text that you want to tokenize |
||
* sentences (the created variable) is a list of the tokenized sentences (each value is a separate sentence) |
* sentences (the created variable) is a list of the tokenized sentences (each value is a separate sentence) |
||
Line 203: | Line 247: | ||
</pre> |
</pre> |
||
(Note: these are instructions from the readme and from developer consultation, not tested yet) |
|||
⚫ | |||
'''WILL BE UPDATED!''' |
|||
===Sentrick=== |
|||
[http://www.denkselbst.de/sentrick2/ Home] |
[http://www.denkselbst.de/sentrick2/ Home] |
||
<br /> |
<br /> |
||
Line 214: | Line 257: | ||
;Training |
;Training |
||
Good luck :) |
|||
Get the developer's version |
Get the developer's version |
||
Line 227: | Line 272: | ||
It will generate a folder called <code>dist</code>. In terminal, type <code>cd dist</code> to get to the dist directory. |
It will generate a folder called <code>dist</code>. In terminal, type <code>cd dist</code> to get to the dist directory. |
||
You need some input data with segmented sentences separated by a newline. |
You need some input data with segmented sentences separated by a newline. <code>editor.sh</code> and <code>snippetCollector.sh</code> (two minumalistic GUI tools) will save the boundary positions (punctuation that ends sentences) in a .bps file: |
||
Type <code>sh editor.sh</code> to open the Sentrick Sentence Boundary Editor (aka editor.sh). |
Type <code>sh editor.sh</code> to open the Sentrick Sentence Boundary Editor (aka editor.sh). |
||
Line 249: | Line 294: | ||
<pre> |
<pre> |
||
sh tdgen.sh dir utf-8 de.denkselbst.sentrick.sbd.NoSbdProvider ../out |
sh tdgen.sh dir utf-8 de.denkselbst.sentrick.sbd.NoSbdProvider ../out |
||
<pre> |
</pre> |
||
dir is the directory of the .txt and the .bps files. de.denkselbst.sentrick.sbd.NoSbdProvider is the sbd. ../out is the output directory (where your 4 .pl files will end up). |
dir is the directory of the .txt and the .bps files. de.denkselbst.sentrick.sbd.NoSbdProvider is the sbd. ../out is the output directory (where your 4 .pl files will end up). |
||
This converts the files to a Prolog file. |
|||
Next, cd to <code>modules/niffler</code>. Build by using: |
|||
<pre> |
|||
ant clean dist -Darch=nachtigaller |
|||
</pre> |
|||
If it gives an error like: |
|||
<pre> |
|||
cannot find symbol : class Term |
|||
</pre> |
|||
That means that it cannot find jpl.jar. Open <code>nachtigaller.properties</code> in <code>modules/niffler</code> and set <code>swi.home</code> to where you installed SWI-Prolog. Set <code>swi.dyn</code> equal to your processor. Here is an example of how <code>nachtigaller.properties</code> may look: |
|||
<pre> |
|||
swi.home=/usr/lib/swi-prolog |
|||
swi.dyn=amd64 |
|||
</pre> |
|||
If it still does not work, run <code>ant clean dist -Darch=nachtigaller</code> from the root folder of your Sentrick installation. |
|||
After it builds, there will be a file called <code>learnVetoRules.sh</code> in <code>modules/niffler/scripts</code>. |
|||
The script needs to be passed the four .pl files generated by tdgen.sh. This may take a while. |
|||
There will be some <code>acceptedByTeacher.*</code> files. <code>acceptedByTeacher.html</code> contains the segmented sentences. The <code>.pl</code> version contains all the generated rules as plain text Prolog. The <code>.ser</code> file can be used to bootstrap the learning cycle so you don't have to make Sentrick relearn everything. |
|||
;Usage |
;Usage |
Latest revision as of 12:51, 26 September 2018
This page gives a review and usage instructions for some sentence-segmenting tools. Keywords: Sentence segmentation, sentence tokenization, sentence tokenisation
Tool | Author/article | Method | Language |
NLTK Punkt | Kiss & Strunk 2006 | Unsupervised | Python 2 & 3 |
splitta | Gillick 2009 | Supervised, SVM | Python 2 |
MxTerminator | Reynar & Ratnaparkhi 1997 | Supervised, maxent | Java |
Sentrick | Patrick Tschorn | seems supervised | Java and Prolog(!) |
Tools[edit]
NLTK Punkt[edit]
You will need to install NLTK and NLTK data. Unfortunately, they both only support Python versions 2.6-2.7. If you are using Python 3, you can run getnltk.py from inside your Python 3 file and it will return the tokenised text.
- Prerequisites & installing NLTK
If you do not have these packages, please install them in the following order:
- Setuptools:
sudo apt-get install python-setuptools
- Pip:
sudo easy_install pip
- (optional) Numpy:
sudo pip install -U numpy
- PyYAML and NLTK:
sudo pip install -U pyyaml nltk
To be sure that it has been correctly installed, type python
. Then, type import nltk
. Nothing should be output. Type quit()
to quit.
- Installing NLTK data
NLTK data is also commonly called nltk.data.
Once you have installed NLTK, you can run the NLTK Downloader to install nltk.data.
- Type
python
to start the Python interpreter. - Type
import nltk
. - Type
nltk.download()
to open the NLTK Downloader.- To download the sentence tokenisation package,
nltk.tokenize.punkt
, typed punkt
. - If you want to get everything (optional but recommended), type
d all
. - For English, you also need the "corpora" package.
- To download the sentence tokenisation package,
- Training
- Create a new Python file.
- Type
import nltk
to import the NLTK package. - You also need to import everything under NLTK (very important!):
from nltk import *
Scenario 1: If you are tokenising a language that uses different punctuation than English (i.e. Armenian), then you need to set the PunktLanguageVars:
language_punkt_vars = nltk.tokenize.punkt.PunktLanguageVars language_punkt_vars.sent_end_chars=('։','՞','՜','.')
Put the characters that are likely to end sentences in single quotation marks as arguments of the sent_end_chars()
function above (second line). Depending on your language, you can change the variable name language_punkt_vars
to whatever is more appropriate.
Add the PunktLanguageVars variable as a second argument for PunktTrainer, like this:
trainer = nltk.tokenize.punkt.PunktTrainer(traindata, language_punkt_vars) params = trainer.get_params() sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(params)
- The first line is to create a PunktTrainer object
- The variable
traindata
is the corpus that you want to train the trainer on. language_punkt_vars
is your PunktLanguageVars variable.- PunktSentenceTokenizer accepts your parameters (
params
), which we'll also use later.
Scenario 2: If the language uses the same punctuation as English, you create the trainer slightly differently:
trainer = nltk.tokenize.punkt.PunktTrainer(traindata) trainer.INCLUDE_ALL_COLLOCS = True trainer.INCLUDE_ABBREV_COLLOCS = True params = trainer.get_params()
- The first line is to create a PunktTrainer object
traindata
is the (string) corpus you want to train the trainer on.INCLUDE_ALL_COLLOCS
makes the trainer look for and remember abbreviations and initialisms.INCLUDE_ABBREV_COLLOCS
makes the trainer look for and remember word pairs where the first word is an abbreviation. It has to be placed afterINCLUDE_ALL_COLLOCS
.- The documentation for the two aforementioned booleans can be found here.
- We'll use the
params
variable later.
For every language, assuming that the variable traindata
is the corpus you want to train your trainer on, do:
trainer.train(traindata)
- You must do this step to get a usable trainer object.
Now your trainer has been trained!
- Usage
Scenario 1: Languages other than English:
First, we must define a sentence boundary detector:
sbd = PunktSentenceTokenizer(params)
That is where we use the params variable that was created in the "Training" section.
Now, we will use the trainer to tokenize our text:
for sentence in sbd.sentences_from_text(tobetokenized, realign_boundaries=True):
tobetokenized
is the string of text that you want to tokenize
Inside the for loop, you can do anything with the newly tokenized sentence (stored in the sentence
variable). For example, you could print them, separated by a newline:
print sentence + "\n"
Now you're finished!
Scenario 2: Tokenize English sentences
Load the english pickle (provided):
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
By default, that is the path to english.pickle
. Change it if you install nltk_data to a different directory when you downloaded it.
Now, you can tokenize the sentences:
sentences = sent_detector.tokenize(tobetokenized.strip())
tobetokenized
is the string of text that you want to tokenize- sentences (the created variable) is a list of the tokenized sentences (each value is a separate sentence)
splitta[edit]
Splitta only works with Python 2.5 or later. It doesn't work with Python 3.
- Training
Unfortunately, it only works for English. You can find the developer's email address here.
- Usage
You can download the latest version of splitta here. Extract the files to a directory.
Download the SVM Light binaries from http://svmlight.joachims.org/ and extract all files any directory. You will need to know the file path of that directory.
Open "sbd.py" (in the splitta directory). Scroll to where is says "## globals". You will see these too lines:
SVM_LEARN = '/u/dgillick/tools/svm_light/svm_learn' SVM_CLASSIFY = '/u/dgillick/tools/svm_light/svm_classify'
Replace the two file paths to wherever you put the files from SVM Light. For example, if you extracted them to the same directory where splitta is, then it would be:
SVM_LEARN = 'svm_learn' SVM_CLASSIFY = 'svm_classify'
Open Terminal and cd to the directory where splitta is. The simplest command to tokenize paragraphs into sentences is:
python sbd.py -m model_nb corpus.txt
That is assuming that corpus.txt is your corpus file. Splitta provides a sample English corpus, called "sample.txt", for you to test with. It prints out:
loading model from [model_nb/]... done! reading [corpus.txt] featurising... done! NB classifying... done! <pre> Afterwards, the tokenized sentences are outputted to the screen. You can specify an output file (the sentences will be outputted to the file) using the -o command: <pre>python sbd.py -m model_nb -o output.txt corpus.txt
MxTerminator[edit]
Get `jmx.tar.gz` from here. Extract it to an empty directory. Edit your CLASSPATH to include mxpost.jar:
export CLASSPATH=/usr/home/<yourname>/<yourdir>/mxpost.jar
Replace `<yourname>` with the name of the user and replace `<yourdir>` with the path to mxpost.jar
The general instructions (too minimalistic/unhelpful) are [file:///home/daniel/jmx/MXTERMINATOR.html here].
- Training
Create an empty directory. This will be the directory for your project. In the directory, you must place a data file. This should contain many sentences in the language that are split by newlines. Then, run this command (traindata is the file aforementioned):
./trainmxterminator projectdir traindata
Note: According to the developer, it should work for utf-8
- Usage
Run the command below. Replace projectdir
with the directory of your project (a sample project, eos.project
, is included) and textfile
with the raw text that you want to tokenize.
./mxterminator projectdir < textfile
(Note: these are instructions from the readme and from developer consultation, not tested yet)
Sentrick[edit]
Home
FAQ (mostly troubleshooting)
New usage page
- Training
Good luck :)
Get the developer's version
git clone git://sentrick.git.sourceforge.net/gitroot/sentrick/sentrick
Run ant clean dist
. If it does not work, go to common-targets.xml
(under ant
) and comment out/delete everything between <target name="test" depends="compile.tests" description="run junit tests" >
and </target>
(advice from developer; it helps).
If it complains about encoding, add the attribute encoding="MacRoman"
to the two
<java c>
tags in common-targets.xml
.
It will generate a folder called
dist
. In terminal, type cd dist
to get to the dist directory.
You need some input data with segmented sentences separated by a newline.
editor.sh
and snippetCollector.sh
(two minumalistic GUI tools) will save the boundary positions (punctuation that ends sentences) in a .bps file:
Type
sh editor.sh
to open the Sentrick Sentence Boundary Editor (aka editor.sh).
In the left panel, navigate to the input data aforementioned. You can change the encoding (not recommended) or how to detect the boundaries (which sentence boundary detector/sbd to use) using the two dropdown menus on the left. Then, press the "Save Boundary Positions" button on the right. It will generate a .bps file with the same name as your .txt input file in the same directory.
Put your .bps and .txt files into a directory.
snippetCollector accomplishes the same thing, it just assumes that you do not have an input file. You can input the text into the textbox on the left and press Segment. Then, press "Save txt,bps as...". It does not let you change the encoding.
Use tdgen.sh (already-built) to create four .pl files.
This is the usage:
tdgen [txt,bps root directory] [encoding] [SbdProvider (for resources)] [training data output directory] <token id prefix>
Here's an example command:
sh tdgen.sh dir utf-8 de.denkselbst.sentrick.sbd.NoSbdProvider ../out
dir is the directory of the .txt and the .bps files. de.denkselbst.sentrick.sbd.NoSbdProvider is the sbd. ../out is the output directory (where your 4 .pl files will end up).
This converts the files to a Prolog file.
Next, cd to modules/niffler
. Build by using:
ant clean dist -Darch=nachtigaller
If it gives an error like:
cannot find symbol : class Term
That means that it cannot find jpl.jar. Open nachtigaller.properties
in modules/niffler
and set swi.home
to where you installed SWI-Prolog. Set swi.dyn
equal to your processor. Here is an example of how nachtigaller.properties
may look:
swi.home=/usr/lib/swi-prolog
swi.dyn=amd64
If it still does not work, run ant clean dist -Darch=nachtigaller
from the root folder of your Sentrick installation.
After it builds, there will be a file called learnVetoRules.sh
in modules/niffler/scripts
.
The script needs to be passed the four .pl files generated by tdgen.sh. This may take a while.
There will be some acceptedByTeacher.*
files. acceptedByTeacher.html
contains the segmented sentences. The .pl
version contains all the generated rules as plain text Prolog. The .ser
file can be used to bootstrap the learning cycle so you don't have to make Sentrick relearn everything.
- Usage
Download the latest version of Sentrick. Currently, Sentrick comes with sentence tokenisation for English and German.
Extract the files into a directory. Sentrick needs two arguments to run: the input file that contains paragraphs that need to be tokenized (-i
) and the file it will output the tokenized sentences to (-o
). An example command is:
./sentrick/bin/sentrick.sh -i input.txt -o output.txt
You must run sentrick.sh
from outside the bin directory.
By default, the encodings of both file will be UTF-8. Optionally, you can declare the encodings of the two files using -ie
(for the input file) and -oe
(for the output file). For example, if you want the input file to be encoded as UCS-2 and the output file to be encoded as UTF-32, you could use this code:
./sentrick/bin/sentrick.sh -i input.txt -o output.txt -ie UCS-2 -oe UTF-32
If you want to tell Sentrick what language the input file is in (it defaults to English), then you can use the -l
argument. Currently, only English (en) and German (de) are supported.