Converts text to a list of words using a preconfigured maximum entropy tokenizer. Punctuation is removed.



Text column
Unhappily for his master, as well as himself, his curiosity drew him unconsciously farther off than he intended to go. At last, having seen the Parsee carnival wind away in the distance, he was turning his steps towards the station, when he happened to espy the splendid pagoda on Malabar Hill, and was seized with an irresistible desire to see its interior. He was quite ignorant that it is forbidden to Christians to enter certain Indian temples, and that even the faithful must not go in without first leaving their shoes outside the door. It may be said here that the wise policy of the British Government severely punishes a disregard of the practices of the native religions.
[Unhappily, for, his, master, as, well, as, himself, his, curiosity, drew, him, unconsciously, farther, off, than, he, intended, to, go, At, last, having, seen, the, Parsee, carnival, wind, away, in, the, distance, he, was, turning, his, steps, towards, the, station, when, he, happened, to, espy, the, splendid, pagoda, on, Malabar, Hill, and, was, seized, with, an, irresistible, desire, to, see, its, interior, He, was, quite, ignorant, that, it, is, forbidden, to, Christians, to, enter, certain, Indian, temples, and, that, even, the, faithful, must, not, go, in, without, first, leaving, their, shoes, outside, the, door, It, may, be, said, here, that, the, wise, policy, of, the, British, Government, severely, punishes, a, disregard, of, the, practices, of, the, native, religions]

Apostrophes for possessive words and contractions aren't removed.

I think that is Jason's baseball.[I, think, that, is, Jason's, baseball]
We're going to get pizza after the game![We're, going, to, get, pizza, after, the, game]

Symbols standing alone like "@" and "&" are individual tokens in the resulting list.

My email address is services @[My, email, address, is, services, @,]
Look over there, it is Jason & Eddie.[Look, over, there, it, is, Jason, &, Eddie]

Symbols joined with or within a word result in a token containing that symbol.

My email address is[My, email, address, is,]
Look over there, it is Jason&Eddie.[Look, over, there, it, is, Jason&Eddie]

Not all symbols joined with words result in a single token.

The baseball cost $50.[The, baseball cost, $, 50]
The importance of #bigdata[The importance, of, #, bigdata]
  • This function is not guaranteed to be 100% accurate.
  • The accuracy of results depends on both the training data being used by the function and the comparable quality of input data.
  • In order to assess result accuracy, you need conduct your own performance evaluation.

