Split text by punctuation

12 Apr 2020

Using `nltk`

Download nltk models with the one-time setup and get the punkt model for sentence parsing (also called sentence tokenizing):

import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  q





True

then use it as:

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
s = "hello, world! It's me, X! Testing this tool."
tokenizer.tokenize(str(s))

['hello, world!', "It's me, X!", 'Testing this tool.']

Using nltk

Using `nltk`