Difference between revisions of "Getnltk.py"

From Apertium
Jump to navigation Jump to search
(Created page with 'getnltk.py is located in <code>/trunk/apertium-tools/scraper/getnltk.py</code>. It was written by [http://wiki.apertium.org/wiki/User:Dtvrij74 Daniel Huang]. <br /> If you want t…')
 
Line 1: Line 1:
getnltk.py is located in <code>/trunk/apertium-tools/scraper/getnltk.py</code>. It was written by [http://wiki.apertium.org/wiki/User:Dtvrij74 Daniel Huang].
+
getnltk.py is located in <code>/trunk/apertium-tools/scraper/getnltk.py</code>. It was written by [http://wiki.apertium.org/wiki/User:Dtvrij74 Daniel Huang]. The purpose is to make NLTK's Punkt sentence tokenizer work on Python 3.
 
<br />
 
<br />
If you want to use NLTK's Punkt sentence tokenizer, you can call <code>getnltk.py</code> in your Python 3 code like:
+
You can call <code>getnltk.py</code> in your Python 3 code like:
 
<pre>
 
<pre>
 
py2output = subprocess.check_output(['python', 'getnltk.py', tosplit, lang])
 
py2output = subprocess.check_output(['python', 'getnltk.py', tosplit, lang])

Revision as of 00:54, 2 January 2013

getnltk.py is located in /trunk/apertium-tools/scraper/getnltk.py. It was written by Daniel Huang. The purpose is to make NLTK's Punkt sentence tokenizer work on Python 3.
You can call getnltk.py in your Python 3 code like:

py2output = subprocess.check_output(['python', 'getnltk.py', tosplit, lang])
  • tosplit is the text that will be tokenized into sentences
  • lang is the 3-letter or 2-letter language code. Currently, it supports English, Russian, and Armenian.

The sentences will be printed to the variable py2output. xml2txt.py (in the same directory) uses getnltk.py.