Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Getnltk.py

From Apertium
Jump to: navigation, search
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

getnltk.py is located in /trunk/apertium-tools/scraper/getnltk.py. It was written by Daniel Huang. The purpose is to make NLTK's Punkt sentence tokenizer work on Python 3.
You can call getnltk.py in your Python 3 code like:

py2output = subprocess.check_output(['python', 'getnltk.py', tosplit, lang])
  • tosplit is the text that will be tokenized into sentences
  • lang is the 3-letter or 2-letter language code. Currently, it supports English, Russian, and Armenian.

The sentences will be printed to the variable py2output. xml2txt.py (in the same directory) uses getnltk.py.

Personal tools