Tokenization API

What is Tokenization?

Tokenization is about splitting a text into smaller entities called tokens. Tokens are different things depending on the type of tokenizer you're using. A token can either be a word, a character, or a sub-word (for example, in the word "higher", there are 2 subwords: "high" and "er"). Punctuation like "!", ".", and ";", can be tokens too.

Tokenization is a fundamental step in every NLP operation. Given the various existing language structures, tokenization is different in every language.

Why Use Tokenization?

You usually don't use tokenization alone but as a first step in a natural language processing pipeline. Tokenization is often a costly operation that can significantly impact the performance of an NLP model, so the choice of the tokenizer is important.

Tokenization with spaCy.

SpaCy is an excellent NLP framework that performs fast and accurate tokenization in many languages.

Tokenization Inference API

Building an inference API for tokenization is an interesting step that can definitely make NLP research easier. Thanks to an API, you can automate your tokenization and do it in any language, not necessarily in Python.

NLP Cloud's Tokenization API

NLP Cloud proposes a tokenization API that gives you the opportunity to perform this operation out of the box, based on spaCy, with excellent performances. Tokenization is not very resource intensive, so the response time (latency), when performing them from the NLP Cloud API, is very good. You can do it in 15 different languages.

For more details, see our documentation about tokenization.

As for all our NLP models, you can use tokenization for free, up to 3 API requests per minute.